Analysis within the Cancer Genomics Cloud (CGC)

The Seven Bridges Cancer Genomics Cloud (CGC) is powered by Velsera and funded by the National Cancer Institute. The CGC is a cloud-based technology that enables analysis, storage, and computation of large cancer datasets. The CGC interoperates with the ICDC making it easier than ever to move from data acquisition to data analysis.

Info

Before conducting differential expession analysis it is important to ensure the quality of the BAM files that are intended to be used as inputs. The percentage of mapped reads can have a large impact on a downstream analysis and could signal an issue with the mapping aligner that was used to generate the BAM file.

Step 1: Inspect summary stats for each file with Sambamba Flagstat

Info

Before conducting differential expession analysis it is important to ensure the quality of the BAM files that are intended to be used as inputs. The percentage of mapped reads can have a large impact on a downstream analysis and could signal an issue with the mapping aligner that was used to generate the BAM file.

Click on Sambamba Flagstat
Click on the Run button from the top right-hand side
Click on the toggle to turn Batching On
Select File from the Batch by dropdown menu
Click on Input alignments
Select all bam files by clicking on the respective checkboxes
Click on the Save Selection button on the top right-hand side
Click on the Run button on the top right-hand side

Step 2: Inspect bam file headers with Samtools View

Info

Before conducting differential expression analysis it is important to determine which reference genome was used to generate the BAM files and how the BAM file is sorted.

Click on Apps from the menu bar
Click on Samtools View
Click on the Run button from the top right-hand side
Under App Settings scroll down to Output the header only and select True from the dropdown menu
Click on the toggle to turn Batching On
Select File from the Batch by dropdown menu
Click on Select file(s) dropdown menu associated with the Input BAM/SAM/CRAM file
Select all bam files by clicking on the respective checkboxes
Click on the Save selection button on the top right-hand side
Click on the Run button on the top right-hand side

Step 3: Load the appropriate reference annotation file

Info

In order to count gene features across exons the HTSeq-count tool will need a Gene Transfer Format (GTF) file that contains information about each gene from a particular reference genome. In this case, the original BAM files were mapped to the NCBI reference genome canFam3.

Download the GTF file located here https://hgdownload.soe.ucsc.edu/goldenPath/canFam3/bigZips/genes/
Click on canFam3.ncbiRefSeq.gtf.gz
Locate this file on your local computer and open to unzip
Navigate back to the CGC and add this file to the project

Step 4: Count sequencing reads with htseq-count

Info

Binary Alignment Mapping files or BAM files are simply compressed binary representation of sequence data mapped to a particular reference genome. Although these files are not human readable, we can use tools such as HTSeq to count sequencing reads that overlap exons for each gene in a reference genome.

Click on Apps from the secondary menu bar
Click HTSeq-count
Click on the Run button from the top right-hand side
Under App Settings select name from the Order dropdown menu
Under App Settings select ignore from the secondary alignments dropdown menu
Under App Settings select ignore from the supplementary alignments dropdown menu
Click on the toggle to turn Batching On
Select File from the Batch by dropdown menu
Click on the Select file(s) button associated with Aligned reads
Select all name sorted bam files from the name_sorted folder
Click on the Select file(s) button associated with Reference annotation file
Select canFam3.ncbiRefSeq.gtf.gz

Step 5: Create a csv file with phenotype data for all samples for DESeq2

Info

Before conducting differential expression analysis a file must be derived to tell DESeq2 how the samples relate to one another. For this tutorial, we can easily generate this file using our exported file manifest from the ICDC.

Click on Files from the menu bar
In the search box type .csv and hit enter
Click on the file manifest which will be named with a series of letters and numbers with a .csv file extension
Click on the Download button to initiate a download to your local machine
Open this file in Excel or any similar application
Move the sample_id column so that it is the first column in the file
Delete all rows pertaining to the Index Files with a file_format of bai
Save the file as phenotype_filtered.csv
The file created should be formatted similar to the file shown below
Click on the Add files button to expand the dropdown menu and select Your Computer
Click on the Browse files button to add the file created named phenotype_filtered.csv
Click on the green Start upload button

Step 6: Conduct differential expression with DESeq2

Info

The DESeq2 package can determine differential expression between sample groups using the raw count table derived by HTSeq and fitting the negative binomial generalized linear model for each gene and then using the Wald test for significance testing. Count outliers are detected using Cook’s distance and can be removed from further analysis. The Wald test p-values from the subset of genes that pass independent filtering can then be adjusted for multiple testing using the Benjamin-Hochburg procedure.

Click on Apps from the menu bar
Click DESeq2
Click on the Run button from the top right-hand side
Under App Settings enter Urinary_Bladder_Cancer_DGE as the Analysis title
Under App Settings enter diagnosis as the Covariate of interest
Under App Settings enter 0.01 as the FDR cutoff
Under App Settings enter Healthy Control as the Factor level - reference
Under App Settings enter Bladder Cancer as the Factor level - test
Under App Settings select htseq from the dropdown menu of Quantification tool
Under App Settings select True from the dropdown menu of log2 fold change shrinkage
Under Inputs click on the Select file(s) button associated with Expression data
- Select the counts.tsv files from the Counts folder
Under Inputs click on the Select file(s) button associated with Gene annotation
- Select canFam3.ncbiRefSeq.gtf
Under Inputs click on the Select file(s) button associated with Phenotype data
- Select phenotype_filtered.csv

Click on the Run button in the upper right-hand corner
When the task has completed successfully inspect the results that should be the same as shown below