Next-Generation Sequencing Workflow
TAKE HOME MESSAGE: There is a lot of work that happens after the Sequencer machine does its work
(Extremely) Generic Steps
- Sample Preparation
- Obtain Sample from Patient, extract DNA from Sample
- Library Preparation
- Fragment and Amplify Template DNA
- Sequencing
- Put in sequencer and run the machine
- Data Analysis
- Short Reads from Sequencer (FASTQ)
- Mapped reads (BAM)
- Output: Genetic sequence
Overall Steps - More Detail
- Sample preparation (extract DNA from biological sample)
- Library Preparation, Construction (chop up DNA into fragments)
- Sequencing
- raw sequencing reads captured: BCL
- (on sequencer) BCL converted and then converted FASTQ
- FASTQ = FASTA + quality info
- Sequence identifier
- Nucleotide sequence - aka the read
- Phred quality info per base
- Data Analysis
- Quality Control, including trimming and/or filtering reads ()
- Align reads to reference genome: FASTQ –> SAM/BAM
- Alignment Cleanup: BAM
- Variant Calling (VCF)
- Variant Annotation
- Add information for each variant: symbol, name, transcript, amino acid sequence
- COSMIC, dbSNP/1000 genomes, panel of normals, etc
- Variant Filtering - Biological
- In a clinical setting, usually no matched normal –> Remove unimportant variants
- Remove known germline variants in population; Improving databases (e.g. dbSNP -> 1000 genomes -> 1000 genomes -> SweGen)
- Variant Filtering - Technical
- VAR cutoff, read depth, variant quality score
- Panel of normals
- General Quality Control
- Base quality, GC bias, distinct reads, percent reads mapped, mapping quality, average depth on target region, uniformity, etc
Data Analysis
Steps in a typical next-generation resequencing workflow. De facto standard file formats are given in parentheses.
(figure taken from nature.com)
index