Jan 15 2015

Empowering Researchers in NGS Data Analysis on the Cloud

by at 5:17 pm

Next generation sequencing (NGS) has grown exponentially since 2007 due to faster, more accurate and affordable sequencing. The potential of NGS is now being tapped in a wide variety of applications including re-sequencing, functional genomics, translational research, and clinical genomics enabling scientists to find new insights in the genome, transcriptome, epigenome and microbiome. These technologies produce massive amounts of data, and their processing and analysis is non-trivial; requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools.

Cloud computing has been synonymous with NGS data processing, since it leverages virtual technology to provide computational resources to users and helps better utilize resources. Its shared computing environment and pay-as-you-go storage can greatly benefit geographically dispersed teams working on the same data.

As we move into this new era of big data, many basic scientists’ and genomic core labs currently rely on third party vendors or bioinformaticians to help with the processing and handling of this big data. Many researchers, who prefer to do the processing on their own, are having to learn how to run command line tools; and like learning any new language, this can pose many challenges, not to mention time consuming. There are many commercial systems that offer solutions through user interfaces including DNA Nexus, Maverix Biomics, Seven Bridges and others. The Innovation Center for Biomedical informatics (ICBI) at Georgetown faced this challenge as well a few years ago; at that time we explored various cloud based solutions, and found the commercial options to be too expensive for an academic center like us to adopt. We hence looked for other options that offered a practical solution to the data management and analysis challenge of NGS data, and found the “Globus Genomics” to be a solution that can save significant time and cost.

We chose the “Globus Genomics” system for a case study due to its scalability, availability of tools, and user-friendliness at an affordable cost. The Globus Genomics system was developed at the Computation Institute, University of Chicago. ICBI collaborated with the Globus Genomics team on a pilot project to develop and test several NGS workflows and have summarized our experiences from the case study in a recently published paper.

The “Globus Genomics” system simplifies terabyte scale data handling and provides advanced tools for NGS data analysis on the cloud. It offers users the capability to process and transfer data easily, reliably and quickly to address end-to-end NGS analysis requirements. The system is built on Amazon’s cloud computing infrastructure and takes advantage of elastic scaling (i.e., increasing and decreasing compute capacity in response to changing demand) of compute resources to run multiple workflows in parallel to help meet the scale-out analysis needs of modern translational genomics research. It is offered as a service that eliminates the need to install and maintain the software, and allows users to run high performance compute (HPC) workflows on the cloud through graphical interfaces; so users don’t have to worry about any operating complexities.

In the case study, we presented three NGS workflows to illustrate the data management and sharing capabilities of the Globus Genomics system. The NGS workflows were for whole genome (WGS), whole exome (WES) and whole transcriptome (RNA-seq) sequencing data. The workflows involved medium scale data presented through the Globus Genomics architecture; providing a fast and scalable solution for pre-processing, analysis, and sharing of large NGS data sets typical for translational genomics projects. The paper also provided guidance to the users of NGS analysis software on how to address the scalability and reproducibility issues with the existing NGS pipelines when dealing with large volumes of data.

The Globus Genomics system allowed efficient data transfer of large number of samples as a batch; and was able to process 21 RNA-seq samples in parallel (average input size 13.5 GB each paired-end set compressed) in about 20-22 hours generating about 3.2 TB of data. The system also processed 78 WES samples (average input size 5.5 GB each paired-end set compressed) and completed execution on about 12 hours and generated about 3.1 TB of data. This will hopefully allow users to roughly predict the time required to complete processing of raw data given the workflow and size of data. The variant calls or the gene/isoform expression data output from the workflows can be exported from the Globus system and further analyzed at the level of gene, pathways and biological processes relevant to disease outcome.

At the end of the case study, we found the system to be user friendly; we believe its user-interface is suitable for scientists who don’t have programming experience. The system is especially suited for genomics cores that need to process increasing amount of NGS data in a short amount of time, and have to share the processed results with their respective clients.

We hope that the Globus Genomics system and our case study will empower genetic researchers to be able to re-use well known publicly available pipelines or build their own and perform rapid analysis of terabyte scale NGS data using just a web browser in a fully automated manner, with no software installation. The power is now in your hands!

Our case study has enabled an implementation of the Globus Genomics system at the Genomics shared resource at Georgetown. This six-month pilot project that will start beginning of 2015 is a big step for the Georgetown community, and will allow for end-to-end processing of NGS data in-house.

ICBI has come a long way in its NGS data processing and analysis capabilities. Apart from WGS, WES and RNASeq pipelines inside the Globus Genomics system, we also have in-house command line pipelines that have been used in G-DOC Plus, and are continuing our efforts to improve our standing in the NGS community. If you are interested in partnering with us, feel free to contact us at: icbi@georgetown.edu.

No responses yet | Categories: General | Tags: , ,

Jan 12 2014

Genomes on Cloud 9

by at 4:51 pm

Genome sequencing is no longer a luxury available only to large genome centers. Recent advancements in next generation sequencing (NGS) technologies and the reduction in cost per genome have democratized access to these technologies to highly diverse research groups. However, limited access to computational infrastructure, high quality bioinformatics software, and personnel skilled to operate the tools remain a challenge. A reasonable solution to this challenge includes user-friendly software-as-a-service running on a cloud infrastructure. There are numerous articles and blogs on advantages and disadvantages of scientific cloud computing. Without repeating the messages from those articles, here I want to capture the lessons learned from our own experience as a small bioinformatics team supporting the genome analysis needs of a medical center using cloud-based resources.

 Why should a scientist care about the cloud?

Reason 1: On-demand computing (such as that offered by cloud resources) can accelerate scientific discovery at low costs. According to Ian Foster, Director of the Computation Institute at the University of Chicago, 42 percent of a federally funded PI’s time is spent on the administrative burden of research including data management. This involves collecting, storing, annotating, indexing, analyzing, sharing and archiving data relevant to their project. At ICBI, we strive to relieve investigators of this data management burden so they can focus on “doing science.” The elastic nature of the cloud allows us to invest as much or as little up front for data storage. We work with sequencing vendors to directly move data to the cloud avoiding damaged hard drives and manual backups. We have taken advantage of Amazon’s Glacier data storage that enables storage of less-frequently used data at ~10 percent of the cost of regular storage. We have optimized our analysis pipelines to convert raw sequence reads from fastq files to BAM files to VCF in 30 minutes for exome sequences using a single large compute instance on AWS with benchmarks at 12 hrs and 5 hrs per sample for whole genome sequencing and RNA sequencing, respectively.

Reason 2: Most of us are not the Broad, BGI or Sanger, says Chris Dagdigian of BioTeam, who is also the co-founder of the BioPerl project. These large genome centers operate multiple megawatt data centers and have dozens of petabytes of scientific data under their management. The rest of the 99 percent of us thankfully deal in much smaller scales of a few thousand terabytes, and thus manage to muddle through using cloud-based or local enterprise IT resources. This model puts datasets such as 1000 genomes, TCGA, UK 10K, etc. in the fingertips (literally a click away) of a lone scientist sitting in front of his/her computer with a web browser.  At ICBI we see the cloud as a powerful shared computing environment, especially when groups are geographically dispersed.  The cloud environment offers readily available reference genomes, datasets and tools.   To our research collaborators, we make available public datasets such as TCGA, dbGAP studies, and NCBI annotations among others. Scientists no longer need to download, transfer, and organize other useful reference datasets to help generate hypotheses specific to their research.

Reason 3: Nothing inspires innovation in the scientific community more than large federal funding opportunities. NIH’s Big Data to Knowledge (BD2K), NCI’s Cancer Cloud Pilot and NSF’s BIG Data Science and Engineering programs are just a few of many programs that support the research community’s innovative and economical uses for the cloud to accelerate scientific discovery. These opportunities will enhance access to data from federally funded projects, innovate to increase compute efficiency and scalability, accelerate bioinformatics tool development, and above all, serve researchers with limited or no high performance computing access.

So, what’s the flip side? We have found that scientists must be cautious while selecting the right cloud (or other IT) solution for their needs, and several key factors must be considered.  Access to large datasets from the cloud will require adequate network bandwidth to transfer data. Tools that run well on local computing resources may have to be re-engineered for the cloud.  For example, in our own work involving exome and RNAseq data, we configured Galaxy NGS tools to take advantage of Amazon cloud resources. While economy of scale is touted as an advantage of cloud-based data management solutions, it can actually turn out to be very expensive to pull data out of the cloud. Appropriate security policies need to be put in place, especially when handling patient data on the cloud. Above all, if the larger scientific community is to fully embrace cloud-based tools, cloud projects must be engineered for end users, hiding all the complexities of the operations of data storage and computes.

My prediction for 2014 is that we will definitely see an increase in biomedical applications of the cloud. This will include usage expansions on both public (e.g. Amazon cloud) and private (e.g. U. Chicago’s Bionimbus) clouds. On that note, I wish you all a very happy new year and happy computing!

Let’s continue the conversation – find me on e-mail at sm696@georgetown.edu or on twitter at @subhamadhavan

No responses yet | Categories: From the director's office,Subha Madhavan | Tags: , , , ,