brownbag-science

Next-Generation Sequencing Workflow

TAKE HOME MESSAGE: There is a lot of work that happens after the Sequencer machine does its work

(Extremely) Generic Steps

Sample Preparation
1. Obtain Sample from Patient, extract DNA from Sample
Library Preparation
1. Fragment and Amplify Template DNA
Sequencing
1. Put in sequencer and run the machine
Data Analysis
1. Short Reads from Sequencer (FASTQ)
2. Mapped reads (BAM)
Output: Genetic sequence

Overall Steps - More Detail

Sample preparation (extract DNA from biological sample)
Library Preparation, Construction (chop up DNA into fragments)
Sequencing
- raw sequencing reads captured: BCL
- (on sequencer) BCL converted and then converted FASTQ
  - FASTQ = FASTA + quality info
    - Sequence identifier
    - Nucleotide sequence - aka the read
    - Phred quality info per base
Data Analysis
- Quality Control, including trimming and/or filtering reads ()
- Align reads to reference genome: FASTQ –> SAM/BAM
- Alignment Cleanup: BAM
- Variant Calling (VCF)
- Variant Annotation
  - Add information for each variant: symbol, name, transcript, amino acid sequence
  - COSMIC, dbSNP/1000 genomes, panel of normals, etc
- Variant Filtering - Biological
  - In a clinical setting, usually no matched normal –> Remove unimportant variants
  - Remove known germline variants in population; Improving databases (e.g. dbSNP -> 1000 genomes -> 1000 genomes -> SweGen)
- Variant Filtering - Technical
  - VAR cutoff, read depth, variant quality score
  - Panel of normals
- General Quality Control
  - Base quality, GC bias, distinct reads, percent reads mapped, mapping quality, average depth on target region, uniformity, etc

Data Analysis

Steps in a typical next-generation resequencing workflow. De facto standard file formats are given in parentheses.

(figure taken from nature.com)