Analysis within the Cancer Genomics Cloud (CGC)
The Seven Bridges Cancer Genomics Cloud (CGC) is powered by Velsera and funded by the National Cancer Institute. The CGC is a cloud-based technology that enables analysis, storage, and computation of large cancer datasets. The CGC interoperates with the ICDC making it easier than ever to move from data acquisition to data analysis.
Info
Before conducting differential expession analysis it is important to ensure the quality of the BAM files that are intended to be used as inputs. The percentage of mapped reads can have a large impact on a downstream analysis and could signal an issue with the mapping aligner that was used to generate the BAM file.
Step 1: Inspect summary stats for each file with Sambamba Flagstat
Info
Before conducting differential expession analysis it is important to ensure the quality of the BAM files that are intended to be used as inputs. The percentage of mapped reads can have a large impact on a downstream analysis and could signal an issue with the mapping aligner that was used to generate the BAM file.
- Click on Sambamba Flagstat
- Click on the button from the top right-hand side
- Click on the toggle to turn Batching On
- Select File from the Batch by dropdown menu
- Click on Input alignments
- Select all bam files by clicking on the respective checkboxes
- Click on the button on the top right-hand side
- Click on the button on the top right-hand side
Step 2: Inspect bam file headers with Samtools View
Info
Before conducting differential expression analysis it is important to determine which reference genome was used to generate the BAM files and how the BAM file is sorted.
- Click on Apps from the menu bar
- Click on Samtools View
- Click on the button from the top right-hand side
- Under App Settings scroll down to Output the header only and select True from the dropdown menu
- Click on the toggle to turn Batching On
- Select File from the Batch by dropdown menu
- Click on Select file(s) dropdown menu associated with the Input BAM/SAM/CRAM file
- Select all bam files by clicking on the respective checkboxes
- Click on the button on the top right-hand side
- Click on the button on the top right-hand side
Step 3: Load the appropriate reference annotation file
Info
In order to count gene features across exons the HTSeq-count tool will need a Gene Transfer Format (GTF) file that contains information about each gene from a particular reference genome. In this case, the original BAM files were mapped to the NCBI reference genome canFam3.
- Download the GTF file located here https://hgdownload.soe.ucsc.edu/goldenPath/canFam3/bigZips/genes/
- Click on canFam3.ncbiRefSeq.gtf.gz
- Locate this file on your local computer and open to unzip
- Navigate back to the CGC and add this file to the project
Step 4: Count sequencing reads with htseq-count
Info
Binary Alignment Mapping files or BAM files are simply compressed binary representation of sequence data mapped to a particular reference genome. Although these files are not human readable, we can use tools such as HTSeq to count sequencing reads that overlap exons for each gene in a reference genome.
- Click on Apps from the secondary menu bar
- Click HTSeq-count
- Click on the button from the top right-hand side
- Under App Settings select name from the Order dropdown menu
- Under App Settings select ignore from the secondary alignments dropdown menu
- Under App Settings select ignore from the supplementary alignments dropdown menu
- Click on the toggle to turn Batching On
- Select File from the Batch by dropdown menu
- Click on the button associated with Aligned reads
- Select all name sorted bam files from the name_sorted folder
- Click on the button associated with Reference annotation file
- Select canFam3.ncbiRefSeq.gtf.gz
Step 5: Create a csv file with phenotype data for all samples for DESeq2
Info
Before conducting differential expression analysis a file must be derived to tell DESeq2 how the samples relate to one another. For this tutorial, we can easily generate this file using our exported file manifest from the ICDC.
- Click on Files from the menu bar
- In the search box type .csv and hit enter
- Click on the file manifest which will be named with a series of letters and numbers with a .csv file extension
- Click on the button to initiate a download to your local machine
- Open this file in Excel or any similar application
- Move the sample_id column so that it is the first column in the file
- Delete all rows pertaining to the Index Files with a file_format of bai
- Save the file as phenotype_filtered.csv
- The file created should be formatted similar to the file shown below
- Click on the button to expand the dropdown menu and select Your Computer
- Click on the button to add the file created named phenotype_filtered.csv
- Click on the green button
Step 6: Conduct differential expression with DESeq2
Info
The DESeq2 package can determine differential expression between sample groups using the raw count table derived by HTSeq and fitting the negative binomial generalized linear model for each gene and then using the Wald test for significance testing. Count outliers are detected using Cook’s distance and can be removed from further analysis. The Wald test p-values from the subset of genes that pass independent filtering can then be adjusted for multiple testing using the Benjamin-Hochburg procedure.
- Click on Apps from the menu bar
- Click DESeq2
- Click on the button from the top right-hand side
- Under App Settings enter Urinary_Bladder_Cancer_DGE as the Analysis title
- Under App Settings enter diagnosis as the Covariate of interest
- Under App Settings enter 0.01 as the FDR cutoff
- Under App Settings enter Healthy Control as the Factor level - reference
- Under App Settings enter Bladder Cancer as the Factor level - test
- Under App Settings select htseq from the dropdown menu of Quantification tool
- Under App Settings select True from the dropdown menu of log2 fold change shrinkage
- Under Inputs click on the Select file(s) button associated with Expression data
- Select the counts.tsv files from the Counts folder
- Under Inputs click on the Select file(s) button associated with Gene annotation
- Select canFam3.ncbiRefSeq.gtf
- Under Inputs click on the Select file(s) button associated with Phenotype data
- Select phenotype_filtered.csv