Bioinformatics Concepts
File formats
An essential aspect of bioinformatics is the utilization of various file formats to store, share, and manipulate this data. In this article, we will explore some of the most common bioinformatics file formats, including FASTA, FASTQ, BAM, BED, BEDGraph, and BigWig, with short examples of each.
FASTA
FASTA is a widely-used text-based format for representing DNA, RNA, or protein sequences. The FASTA format begins with a single-line description, starting with a '>' symbol, followed by the sequence itself.
Example:
>Sample_DNA_sequence
ATGCAGTAGCTAGCTAGCTAGCTAGCTAGCTAG
FASTQ
FASTQ is an extension of the FASTA format, designed for storing both sequence data and quality scores. Each entry consists of four lines: a description line (starting with '@'), the sequence, a separator line (starting with '+'), and the quality scores.
Example:
@Sample_DNA_sequence
ATGCAGTAGCTAGCTAGCTAGCTAGCTAGCTAG
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
BAM (Binary Alignment/Map) and SAM (Sequence Alignment/Map)
BAM is a binary format used for storing sequence alignments. It is the compressed binary version of the Sequence Alignment/Map (SAM) format, which allows for efficient storage and retrieval of large-scale sequence data. BAM files are often used in conjunction with genomic analysis tools like the Genome Analysis Toolkit (GATK) or SAMtools.
The SAM format is a text-based format for storing sequence alignment data, including reads mapped to a reference genome. SAM files contain a header section with metadata and a subsequent section containing individual alignment records, one per line.
Each alignment record in a SAM file consists of multiple tab-separated columns containing information about the read and its alignment. The primary columns in a SAM file are:
- QNAME: Query template name (i.e., read identifier)
- FLAG: Bitwise flag representing various properties of the alignment (e.g., paired-end, properly paired, mapped, etc.)
- RNAME: Reference sequence name (i.e., chromosome or contig)
- POS: 1-based leftmost mapping position of the read on the reference
- MAPQ: Mapping quality (Phred-scaled probability that the alignment is incorrect)
- CIGAR: CIGAR string representing the alignment, with operations like match (M), insertion (I), deletion (D), etc.
- RNEXT: Reference name of the mate/next read
- PNEXT: Position of the mate/next read
- TLEN: Observed template length
- SEQ: Read sequence
- QUAL: Read quality scores (Phred-scaled)
Example:
@HD VN:1.6 SO:coordinate
@SQ SN:chr1 LN:248956422
@SQ SN:chr2 LN:242193529
@PG ID:bowtie2 PN:bowtie2 VN:2.3.4.1 CL:"bowtie2-align-s --wrapper basic-0 -x reference -S output.sam -1 read1.fastq -2 read2.fastq"
read1_1234 99 chr1 10000 60 100M = 10200 300 AGCTTAGCTAGCTAGCTAGCTTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAG *
read2_5678 163 chr1 10200 60 100M = 10000 -300 TAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC *
In this example, the SAM file begins with a header section containing metadata about the file, including the reference sequence names and lengths, as well as the software used for alignment. The alignment records follow, with each line representing an individual read alignment.
BED (Browser Extensible Data)
The BED format is a simple, tab-delimited text format for representing genomic features and their associated annotations, such as genes, exons, or regulatory elements. BED files consist of one line per feature, with each line containing a minimum of three columns (chromosome, start, and end), and additional optional columns for feature name, score, strand, etc.
Example:
chr1 1000 2000 Feature1 0 +
chr1 3000 4000 Feature2 0 -
BEDGraph
The BEDGraph format is a variation of the BED format used for representing continuous data, such as read coverage or signal intensities, in a genome browser. Like BED files, BEDGraph files are tab-delimited text files, but they contain four columns (chromosome, start, end, and value) representing the genomic region and the associated data value. Strand information is provided in the 4th column, negative values represent reverse strand, whereas positive values represent forward strand.
Example:
chr1 1000 2000 1.5
chr1 2000 3000 -2.0
chr1 3000 4000 3.5