Krithika Bhuvaneshwar's Weblog


Jan 15 2015

Empowering Researchers in NGS Data Analysis on the Cloud

Next generation sequencing (NGS) has grown exponentially since 2007 due to faster, more accurate and affordable sequencing. The potential of NGS is now being tapped in a wide variety of applications including re-sequencing, functional genomics, translational research, and clinical genomics enabling scientists to find new insights in the genome, transcriptome, epigenome and microbiome. These technologies produce massive amounts of data, and their processing and analysis is non-trivial; requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools.

Cloud computing has been synonymous with NGS data processing, since it leverages virtual technology to provide computational resources to users and helps better utilize resources. Its shared computing environment and pay-as-you-go storage can greatly benefit geographically dispersed teams working on the same data.

As we move into this new era of big data, many basic scientists’ and genomic core labs currently rely on third party vendors or bioinformaticians to help with the processing and handling of this big data. Many researchers, who prefer to do the processing on their own, are having to learn how to run command line tools; and like learning any new language, this can pose many challenges, not to mention time consuming. There are many commercial systems that offer solutions through user interfaces including DNA Nexus, Maverix Biomics, Seven Bridges and others. The Innovation Center for Biomedical informatics (ICBI) at Georgetown faced this challenge as well a few years ago; at that time we explored various cloud based solutions, and found the commercial options to be too expensive for an academic center like us to adopt. We hence looked for other options that offered a practical solution to the data management and analysis challenge of NGS data, and found the “Globus Genomics” to be a solution that can save significant time and cost.

We chose the “Globus Genomics” system for a case study due to its scalability, availability of tools, and user-friendliness at an affordable cost. The Globus Genomics system was developed at the Computation Institute, University of Chicago. ICBI collaborated with the Globus Genomics team on a pilot project to develop and test several NGS workflows and have summarized our experiences from the case study in a recently published paper.

The “Globus Genomics” system simplifies terabyte scale data handling and provides advanced tools for NGS data analysis on the cloud. It offers users the capability to process and transfer data easily, reliably and quickly to address end-to-end NGS analysis requirements. The system is built on Amazon’s cloud computing infrastructure and takes advantage of elastic scaling (i.e., increasing and decreasing compute capacity in response to changing demand) of compute resources to run multiple workflows in parallel to help meet the scale-out analysis needs of modern translational genomics research. It is offered as a service that eliminates the need to install and maintain the software, and allows users to run high performance compute (HPC) workflows on the cloud through graphical interfaces; so users don’t have to worry about any operating complexities.

In the case study, we presented three NGS workflows to illustrate the data management and sharing capabilities of the Globus Genomics system. The NGS workflows were for whole genome (WGS), whole exome (WES) and whole transcriptome (RNA-seq) sequencing data. The workflows involved medium scale data presented through the Globus Genomics architecture; providing a fast and scalable solution for pre-processing, analysis, and sharing of large NGS data sets typical for translational genomics projects. The paper also provided guidance to the users of NGS analysis software on how to address the scalability and reproducibility issues with the existing NGS pipelines when dealing with large volumes of data.

The Globus Genomics system allowed efficient data transfer of large number of samples as a batch; and was able to process 21 RNA-seq samples in parallel (average input size 13.5 GB each paired-end set compressed) in about 20-22 hours generating about 3.2 TB of data. The system also processed 78 WES samples (average input size 5.5 GB each paired-end set compressed) and completed execution on about 12 hours and generated about 3.1 TB of data. This will hopefully allow users to roughly predict the time required to complete processing of raw data given the workflow and size of data. The variant calls or the gene/isoform expression data output from the workflows can be exported from the Globus system and further analyzed at the level of gene, pathways and biological processes relevant to disease outcome.

At the end of the case study, we found the system to be user friendly; we believe its user-interface is suitable for scientists who don’t have programming experience. The system is especially suited for genomics cores that need to process increasing amount of NGS data in a short amount of time, and have to share the processed results with their respective clients.

We hope that the Globus Genomics system and our case study will empower genetic researchers to be able to re-use well known publicly available pipelines or build their own and perform rapid analysis of terabyte scale NGS data using just a web browser in a fully automated manner, with no software installation. The power is now in your hands!

Our case study has enabled an implementation of the Globus Genomics system at the Genomics shared resource at Georgetown. This six-month pilot project that will start beginning of 2015 is a big step for the Georgetown community, and will allow for end-to-end processing of NGS data in-house.

ICBI has come a long way in its NGS data processing and analysis capabilities. Apart from WGS, WES and RNASeq pipelines inside the Globus Genomics system, we also have in-house command line pipelines that have been used in G-DOC Plus, and are continuing our efforts to improve our standing in the NGS community. If you are interested in partnering with us, feel free to contact us at:

No responses yet Tags: General

May 16 2014

Highlights from TCGA 3rd Annual Symposium

The Cancer Genome Atlas’ 3rd annual scientific symposium – a report

Earlier this month, I had the opportunity to attend the 3rd annual TCGA symposium at NIH, Bethesda. The TCGA symposium is an open scientific meeting that invites all scientists, who use or wish to use TCGA data, to share and discuss their novel research findings using this data. Although a frequent user of TCGA data, this was my first visit to the symposium and I was excited to see so many other researchers using these datasets to create new knowledge in cancer research. Here I have highlighted a few talks from the symposium.

Dr. Christopher C. Benz and team studied mutations across 12 different cancer types and found P1K3CA occurring in 8 types of cancer. Their analysis showed that breast and kidney cancers favor kinase domain mutations to enhance PI3K catalytic activity and drive cell proliferation, while lung and hand-and-neck squamous cancers favor helical domain mutations to preferentially enhance their malignant cell motility. It was interesting to see how different pathways are affected based on the domain of mutation, and such insights could help understand these mechanisms better.

Samir B. Amin and team profiled long intergenic non-coding RNA (lincRNA) interactions in cancer. The results of profiling show that cancer samples could be stratified/clustered according to cancer type and or cancer stage based on lincRNA expression data.

Another interesting talk was by Dr. Rehan Akbani whose team profiled proteomics data across multiple cancer types using reverse phase protein arrays (RPPA) to analyze more than 3000 patients from 11 TCGA diseases using 181 antibodies that target a panel of known cancer related proteins. Their findings identify several novel and potentially actionable single-tumor and cross-tumor targets and pathways. Their analyses also show that tumor samples demonstrate a much more complex regulation of protein expression than cell lines, most likely due to microenvironment i.e. stroma-tumor interactions and or immune cells – tumor interactions.

Gastric cancer (GC) is the third leading cause of death worldwide, after lung and liver cancers, respectively. Most clinical trials currently recruit patients with stomach cancer and find that all patients do not respond the same way to treatment, implying an underlying heterogeneity in the tumors.  Adam Bass’s group at Dana Farber Cancer Institute did a comprehensive molecular evaluation of 295 primary gastric adenocarcinomas. Using a cluster of clusters and iCluster methods, they have separated GC into four subtypes:

  1. Tumors positive for Epstein-Barr virus – displaying recurrent PIK3CA mutations and extreme DNA hypermethylation.
  2. Microsatellite unstable tumors – showing elevated mutation rates, including mutations of genes encoding targetable oncogenic signaling proteins.
  3. Genomically stable tumors – enriched for the diffuse histologic variant and mutations of RHOA or fusions involving RHO-family GTPase-activating proteins.
  4. Tumors with chromosomal instability – showing marked aneuploidy and focal amplification of receptor tyrosine kinases.

They also found that tumor characteristics vary based on the tumor site in the stomach – tumors found in the middle of the stomach have more EBV positive and have strong methylation differences. Here’s hoping that understanding these tumor subtypes in GC will help develop treatments specific to each subtype and eventually improve gastric cancer survival in the future.

Even though the TCGA data analysis is synonymous with integrative analyses on multi-omics data, it was interesting to see in-depth analyses of single data types – including associations with viral DNA and yeast models; in-depth analysis of splicing, mRNA splicing mutations and copy number aberrations respectively. The TCGA data collection has not only compiled multi-omics data for various cancer types, but also imaging and pathology images for many samples that could be used for validation of results from ‘omics’ analyses.

Like a kid in a candy show, I was most surprised and excited to see a number of online portals and freely available software and tools showcased in the posters that take advantage of the TCGA big data collection. Some of them are highlighted below.

Online tools/portals:

  • CRAVAT 3.0 – predicts the functional effect of variant on their translated protein, predicts whether the submitted variants are cancer drivers or not.
  • MAGI – For mutation annotation and gene interpretation
  • SpliceSeq – Allows users to interactively explore splicing variation across TCGA tumor types
  • TCGA Compass – Allows users to explore clinical data, methylation, miRNA and mRNA seq data from TCGA

Online resources:

Downloadable tools from Github/R:

  • THetA – Program for tumor Heterogeneity Analysis
  • ABRA – Tool for improved indel detection
  • Hotnet2 algorithm – Identifies significantly mutated sub-networks in a PPI network
  • Switch plus – An R package in the making that uses segment copy number data on various cancer types to show differences in human and mouse models

It is energizing to see the collective efforts being taken to make this data collection more readable and parsable. I’m sure the biomedical informatics community will be more than pleased to know that it is becoming easier to explore and find what one is looking for within the TCGA data collection.

Comments by Krithika Bhuvaneshwar with contributions by Dr. Yuriy Gusev

No responses yet Tags: General