The # of base pairs is frequently used as synonym for the # of nucleotides in a single-strand sequence ● This sequence has 5 nucleotides: ACGGT ● We can also say that it has 5 base pairs ● kilo, giga, etc for sequence lengths ● kb → kilo-bases ● Mb → Mega-bases ● Gb → Giga-bases
The full genetic information of an organism ● Contains all chromosomes ● Comprises the coding & non-coding sequence data of the organism ● Coding sequence data → part of the genome that encodes proteins ● Non-coding (in earlier days: junk) DNA → part of the genome that does not encode proteins but still has a function – The function of non-coding DNA is only partially known – Non-coding DNA regulates protein processes
Single-Strand DNA:
Coding versus non-coding DNA
exons: Some parts of the DNA are coding (get turned into protein)
introns: Some parts are non-coding (think “junk” DNA)
depth of coverage: a measure of the number of times that a specific genomic site is sequenced during a sequencing run. The more coverage the better
nucleotide:
base: a nucleotide
bp: base pairs (same as a base)
sequence lengths:
read length: the number of base pairs that a sequencer “reads”. A read length could by anywhere from 50 bp to > 1000s
read: a single sequence produced from a sequencer. Think: a sequencing machine read a molecule and this is what it thinks it is.
library: a collection of DNA fragments that have been prepared for sequencing. This is generally talking about individual samples.
run: an entire sequencing reaction from start to finish.
NGS: next-generation sequencing. High-throughput (DNA) sequencing - technologies developed after ~ 2000.
hg19: The UCSC assembly of the human genome, version 19; equivalent to GRCh37.
hg38: The UCSC assembly of the human genome, version 38; equivalent GRCh38.
GRCh37: Genome Reference Consortium Human Build 37; matches with UCSC assembly hg19. Released in February 2009.
GRCh38: Genome Reference Consortium Human Build 38; matches with UCSC assembly hg38. Released in December 2013.