File formats

An essential aspect of bioinformatics is the utilization of various file formats to store, share, and manipulate this data. In this article, we will explore some of the most common bioinformatics file formats, including FASTA, FASTQ, BAM, BED, BEDGraph, and BigWig, with short examples of each.

FASTA

FASTA is a widely-used text-based format for representing DNA, RNA, or protein sequences. The FASTA format begins with a single-line description, starting with a '>' symbol, followed by the sequence itself.

Example:

>Sample_DNA_sequence
ATGCAGTAGCTAGCTAGCTAGCTAGCTAGCTAG

FASTQ

FASTQ is an extension of the FASTA format, designed for storing both sequence data and quality scores. Each entry consists of four lines: a description line (starting with '@'), the sequence, a separator line (starting with '+'), and the quality scores.

Example:

@Sample_DNA_sequence
ATGCAGTAGCTAGCTAGCTAGCTAGCTAGCTAG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

BAM (Binary Alignment/Map) and SAM (Sequence Alignment/Map)

BAM is a binary format used for storing sequence alignments. It is the compressed binary version of the Sequence Alignment/Map (SAM) format, which allows for efficient storage and retrieval of large-scale sequence data. BAM files are often used in conjunction with genomic analysis tools like the Genome Analysis Toolkit (GATK) or SAMtools.

The SAM format is a text-based format for storing sequence alignment data, including reads mapped to a reference genome. SAM files contain a header section with metadata and a subsequent section containing individual alignment records, one per line.

Each alignment record in a SAM file consists of multiple tab-separated columns containing information about the read and its alignment. The primary columns in a SAM file are:

QNAME: Query template name (i.e., read identifier)
FLAG: Bitwise flag representing various properties of the alignment (e.g., paired-end, properly paired, mapped, etc.)
RNAME: Reference sequence name (i.e., chromosome or contig)
POS: 1-based leftmost mapping position of the read on the reference
MAPQ: Mapping quality (Phred-scaled probability that the alignment is incorrect)
CIGAR: CIGAR string representing the alignment, with operations like match (M), insertion (I), deletion (D), etc.
RNEXT: Reference name of the mate/next read
PNEXT: Position of the mate/next read
TLEN: Observed template length
SEQ: Read sequence
QUAL: Read quality scores (Phred-scaled)

Example:

@HD VN:1.6  SO:coordinate
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@PG ID:bowtie2  PN:bowtie2  VN:2.3.4.1  CL:"bowtie2-align-s --wrapper basic-0 -x reference -S output.sam -1 read1.fastq -2 read2.fastq"

read1_1234   99   chr1   10000   60   100M  =   10200   300   AGCTTAGCTAGCTAGCTAGCTTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG * 
read2_5678   163  chr1   10200   60   100M  =   10000   -300  TAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC *

In this example, the SAM file begins with a header section containing metadata about the file, including the reference sequence names and lengths, as well as the software used for alignment. The alignment records follow, with each line representing an individual read alignment.

BED (Browser Extensible Data)

The BED format is a simple, tab-delimited text format for representing genomic features and their associated annotations, such as genes, exons, or regulatory elements. BED files consist of one line per feature, with each line containing a minimum of three columns (chromosome, start, and end), and additional optional columns for feature name, score, strand, etc.

Example:

chr1    1000    2000    Feature1    0    +
chr1    3000    4000    Feature2    0    -

BEDGraph

The BEDGraph format is a variation of the BED format used for representing continuous data, such as read coverage or signal intensities, in a genome browser. Like BED files, BEDGraph files are tab-delimited text files, but they contain four columns (chromosome, start, end, and value) representing the genomic region and the associated data value. Strand information is provided in the 4th column, negative values represent reverse strand, whereas positive values represent forward strand.

Example:

chr1    1000    2000    1.5
chr1    2000    3000    -2.0
chr1    3000    4000    3.5