Flow Logo

Core Flow Concepts

Data

Data in Flow represents any file used or generated during bioinformatics analysis. From raw sequencing reads to final publication figures, Flow tracks, organizes, and preserves all data while maintaining complete provenance and ensuring reproducibility.


What is Data?

In Flow, "data" encompasses:

  1. Input files: Raw sequencing data, reference genomes, annotation files
  2. Intermediate files: Aligned reads, normalized counts, quality reports
  3. Output files: Analysis results, visualizations, summary statistics
  4. Metadata files: Sample sheets, configuration files, logs
  5. Associated files: Publications, protocols, supplementary materials

Every piece of data in Flow is:

  • Tracked: With checksums, sizes, and creation timestamps
  • Owned: By a specific user or group
  • Versioned: Changes create new versions, preserving originals
  • Accessible: Through permissions and sharing controls
  • Preserved: With backups and long-term storage options

Data Types

Sequencing Data

FASTQ Files

  • Raw sequencing reads
  • Single-end or paired-end
  • Compressed formats (.gz, .bz2)
  • Quality scores included
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

BAM/SAM Files

  • Aligned sequencing reads
  • Binary (BAM) or text (SAM) format
  • Contains alignment information
  • Indexed for fast access

CRAM Files

  • Compressed alignment format
  • Reference-based compression
  • Smaller than BAM files
  • Requires reference genome

Analysis Results

Count Matrices

  • Gene expression values
  • Peak counts
  • Feature quantification
  • Usually tab-delimited

Differential Analysis

  • Statistical comparisons
  • Fold changes and p-values
  • Gene lists
  • Pathway enrichments

Visualizations

  • Quality control plots
  • Heatmaps and clustering
  • PCA/t-SNE plots
  • Publication figures

Reference Data

Genome Files

  • FASTA format sequences
  • Chromosome assemblies
  • Indexed for alignment
  • Species-specific

Annotation Files

  • GTF/GFF gene models
  • BED format regions
  • Gene ontology data
  • Pathway databases

Metadata Files

Sample Sheets

  • Experimental design
  • Sample-to-file mapping
  • Batch information
  • Clinical data

Configuration Files

  • Pipeline parameters
  • Tool settings
  • Resource allocations
  • Custom options

Data Lifecycle

1. Upload Phase

Data enters Flow through several routes:

Direct Upload

  • Web interface drag-and-drop
  • Progress tracking
  • Automatic validation
  • Batch uploading

API Upload

  • Programmatic transfer
  • Integration with LIMS
  • Automated workflows
  • Bulk operations

External Import

  • From sequencing facilities
  • Cloud storage (S3, GCS)
  • FTP/SFTP servers
  • Public databases

2. Validation Phase

Upon upload, Flow validates:

  • File integrity: Checksums match
  • Format compliance: Valid file structure
  • Completeness: All expected files present
  • Quality metrics: Basic QC checks
  • Metadata: Required fields populated

3. Storage Phase

Data is stored with:

  • Redundancy: Multiple copies for safety
  • Compression: Space-efficient storage
  • Organization: Logical directory structure
  • Indexing: Fast retrieval
  • Encryption: Security at rest

4. Processing Phase

During analysis:

  • Access control: Permission verification
  • Staging: Data prepared for compute
  • Monitoring: Track usage and progress
  • Temporary files: Managed workspace
  • Result capture: Automatic collection

5. Archive Phase

Long-term preservation:

  • Cold storage: Cost-effective archival
  • Metadata preservation: Complete context
  • Retrieval options: On-demand access
  • Compliance: Regulatory requirements
  • Retention policies: Automated management

Data Organization

Hierarchical Structure

Flow organizes data hierarchically:

Project/
├── Samples/
│   ├── Sample_001/
│   │   ├── raw_data/
│   │   │   ├── sample_001_R1.fastq.gz
│   │   │   └── sample_001_R2.fastq.gz
│   │   └── executions/
│   │       ├── rnaseq_run_001/
│   │       │   ├── alignments/
│   │       │   ├── counts/
│   │       │   └── qc_reports/
│   │       └── rnaseq_run_002/
│   └── Sample_002/
└── Shared_Data/
    ├── reference_genome/
    └── annotation_files/

Naming Conventions

Automatic Naming

  • System-generated unique IDs
  • Timestamp-based versioning
  • Pipeline-specific patterns
  • Collision avoidance

Custom Naming

  • User-defined schemes
  • Metadata-based patterns
  • Batch renaming tools
  • Export formatting

Data Relationships

Parent-Child

  • Raw data → Processed data
  • Sample → Analysis results
  • Pipeline → Output files

Versioning

  • Original → Modified versions
  • Parameter changes → New outputs
  • Re-analysis → Updated results

Cross-References

  • Multiple samples → Combined analysis
  • Shared references → Many analyses
  • Integrated results → Multi-omics

Data Access

Permissions Model

Access Levels:

  1. None: No access to data
  2. View: Can see and download
  3. Use: Can use in analyses
  4. Modify: Can update metadata
  5. Delete: Can remove data

Inheritance:

  • Project permissions flow to data
  • Sample permissions apply to results
  • Explicit permissions override inherited

Sharing Mechanisms

Direct Sharing

  • Grant specific user access
  • Set expiration dates
  • Track access history
  • Revoke when needed

Group Sharing

  • Share with lab groups
  • Institutional access
  • Collaborative projects
  • Managed centrally

Public Sharing

  • Make data publicly accessible
  • Generate stable URLs
  • DOI assignment
  • Citation tracking

Data Transfer

Download Options

  • Individual file download
  • Bulk download with manifest
  • Compressed archives
  • Cloud-to-cloud transfer

Streaming Access

  • View without downloading
  • Partial file access
  • Real-time processing
  • Bandwidth optimization

Data Quality

Quality Metrics

Sequencing Quality

  • Per-base quality scores
  • Read length distribution
  • Adapter contamination
  • Duplication rates
  • GC content

Alignment Quality

  • Mapping rates
  • Insert size distribution
  • Coverage uniformity
  • Mapping quality scores

Analysis Quality

  • Statistical confidence
  • Technical replicates
  • Batch effects
  • Outlier detection

Quality Control

Automatic QC

  • FastQC on uploads
  • MultiQC aggregation
  • Pipeline-specific checks
  • Threshold alerts

Manual Review

  • Interactive reports
  • Custom visualizations
  • Comparative analysis
  • Expert annotation

Data Formats

Supported Formats

Sequencing

  • FASTQ (.fastq, .fq, .fastq.gz)
  • FASTA (.fasta, .fa, .fasta.gz)
  • SAM/BAM/CRAM (.sam, .bam, .cram)

Genomic Regions

  • BED (.bed, .bed.gz)
  • GFF/GTF (.gff, .gtf, .gff3)
  • VCF (.vcf, .vcf.gz)
  • BigWig/BigBed (.bw, .bb)

Tabular Data

  • CSV/TSV (.csv, .tsv, .txt)
  • Excel (.xlsx, .xls)
  • HDF5 (.h5, .hdf5)
  • Parquet (.parquet)

Images

  • PNG/JPEG (.png, .jpg, .jpeg)
  • SVG (.svg)
  • PDF (.pdf)
  • TIFF (.tif, .tiff)

Format Conversion

Flow provides automatic conversion between:

  • FASTQ ↔ FASTA
  • SAM ↔ BAM
  • GFF ↔ GTF
  • Various compression formats

Best Practices

Data Upload

  1. Verify before upload

    • Check file integrity
    • Confirm correct files
    • Review file sizes
    • Test with small batch
  2. Organize locally first

    • Clear naming scheme
    • Logical folder structure
    • Complete metadata
    • Remove duplicates
  3. Use appropriate methods

    • Web for small files
    • API for automation
    • Bulk upload for many files
    • Direct transfer for large data

Data Management

  1. Maintain metadata

    • Document thoroughly
    • Update regularly
    • Use controlled vocabularies
    • Link related data
  2. Monitor usage

    • Track storage costs
    • Review access logs
    • Clean up temporary files
    • Archive completed projects
  3. Plan for growth

    • Estimate storage needs
    • Budget for computation
    • Design scalable organization
    • Implement retention policies

Data Sharing

  1. Set permissions carefully

    • Minimum necessary access
    • Time-limited shares
    • Review regularly
    • Document decisions
  2. Prepare for publication

    • Clean up file names
    • Complete all metadata
    • Generate DOIs
    • Create data packages

Advanced Features

Data Provenance

Complete tracking of data lineage:

  • Origin and creation time
  • All processing steps
  • Parameter settings used
  • Software versions
  • User actions

Data Integration

Combine data from multiple sources:

  • Multi-omics integration
  • Cross-project analysis
  • External database links
  • Federated queries

Automated Workflows

Trigger actions on data events:

  • Auto-process on upload
  • Quality check pipelines
  • Notification rules
  • Archival policies

Data APIs

Programmatic access to all data:

  • RESTful endpoints
  • GraphQL queries
  • Streaming protocols
  • Bulk operations

Troubleshooting

Common Issues

Upload failures?

  • Check network stability
  • Verify file permissions
  • Ensure sufficient space
  • Try smaller chunks

Can't access data?

  • Verify permissions
  • Check project membership
  • Confirm data location
  • Contact data owner

Format not recognized?

  • Verify file extension
  • Check file contents
  • Try manual format specification
  • Convert if necessary

Performance issues?

  • Use appropriate access method
  • Consider data locality
  • Optimize query patterns
  • Request increased resources

Next Steps

For a comprehensive overview of Flow's data concepts, see the Core Concepts guide.

Previous
Core Concepts Overview