brownbag-science

File formats

Summary

Format Who generates it? Who reads it?
FASTQ sequencers, simulation tools mapping tools, QC tools, cleaning tools, taxonomic assignation tools
FASTA assembly tools, gene prediction tools visualization tools, almost all
SAM/BAM/BAI mapping tools, samtools visualization tools, variant discovery tools, counting tools
BED annotation tools, bedtools visualization tools, variant discovery tools, peak calling tools, counting tools
GFF annotation tools visualization tools, variant discovery tools, peak calling tools, RNAseq tools
VCF variant discovery tools vcftools, visualization tools, variant discovery tools

Pipeline

Data obtained from next-generation sequencing data must be processed several times. Most of the processing steps are aimed at extracting only that information needed for a specific down-stream analysis, with redundant entries often discarded. Therefore, specific data formats are often associated with different steps of a data processing pipeline.

Here, we just want to give very brief key descriptions of the file, for elaborate information we will link to external websites. Be aware, that the file name sorting here is alphabetical, not according to their usage within an analysis pipeline that is depicted here:

../_images/flowChart_FileFormats.png

Follow the links for more information on the different tool collections mentioned in the figure:

samtools UCSCtools BEDtools

Source

NGS file formats

BAM

BED

bedGraph

chr1 10 20 1.5
chr1 20 30 1.7
chr1 30 40 2.0
chr1 40 50 1.8

bigWig

FASTA

Common format for holding sequence data. Wikipedia: FASTA

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
 LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
 EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
 LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
 GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
 IENY
>Mus_musculus_tRNA-Ala-AGC-1-1 (chr13.trna34-AlaAGC)
GGGGGTGTAGCTCAGTGGTAGAGCGCGTGCTTAGCATGCACGAGGcCCTGGGTTCGATCC
CCAGCACCTCCA
>Mus_musculus_tRNA-Ala-AGC-10-1 (chr13.trna457-AlaAGC)
GGGGGATTAGCTCAAATGGTAGAGCGCTCGCTTAGCATGCAAGAGGtAGTGGGATCGATG
CCCACATCCTCCA

FASTQ

The FASTQ format is the de facto standard used by sequencing instruments; records both sequence and its corresponding quality scores. FASTA with Quality scores. Wikipedia: FASTQ

SAM

The SAM/BAM formats are so-called Sequence Alignment Maps and typically represent the results of aligning a FASTQ file to a reference FASTA file and describe the individual, pairwise alignments that were found. Different algorithms may create different alignments (and hence BAM files)

../_images/glossary_sam.png

SAM header section

SAM alignment section

SAM Tools

:warning: Warning

Although the SAM/BAM format is rather meticulously defined and documented, whether an alignment program will produce a SAM/BAM file that adheres to these principles is completely up to the programmer. The mapping score, CIGAR string, and particularly, all optional flags (fields >11) are often very differently defined depending on the program. If you plan on filtering your data based on any of these criteria, make sure you know exactly how these entries were calculated and set!

Also See

2bit

FastQC

FastQC Andrews, 2010 - QC for (Illumina) FastQ files

BAI

BAM files with Index

Pileup

GFF

GTF