Next generation sequencing (NGS) has grown exponentially since 2007 due to faster, more accurate and affordable sequencing. The potential of NGS is now being tapped in a wide variety of applications including re-sequencing, functional genomics, translational research, and clinical genomics enabling scientists to find new insights in the genome, transcriptome, epigenome and microbiome. These technologies produce massive amounts of data, and their processing and analysis is non-trivial; requiring a powerful computational infrastructure, high quality bioinformatics software, and skilled personnel to operate the tools.
Cloud computing has been synonymous with NGS data processing, since it leverages virtual technology to provide computational resources to users and helps better utilize resources. Its shared computing environment and pay-as-you-go storage can greatly benefit geographically dispersed teams working on the same data.
As we move into this new era of big data, many basic scientists’ and genomic core labs currently rely on third party vendors or bioinformaticians to help with the processing and handling of this big data. Many researchers, who prefer to do the processing on their own, are having to learn how to run command line tools; and like learning any new language, this can pose many challenges, not to mention time consuming. There are many commercial systems that offer solutions through user interfaces including DNA Nexus, Maverix Biomics, Seven Bridges and others. The Innovation Center for Biomedical informatics (ICBI) at Georgetown faced this challenge as well a few years ago; at that time we explored various cloud based solutions, and found the commercial options to be too expensive for an academic center like us to adopt. We hence looked for other options that offered a practical solution to the data management and analysis challenge of NGS data, and found the “Globus Genomics” to be a solution that can save significant time and cost.
We chose the “Globus Genomics” system for a case study due to its scalability, availability of tools, and user-friendliness at an affordable cost. The Globus Genomics system was developed at the Computation Institute, University of Chicago. ICBI collaborated with the Globus Genomics team on a pilot project to develop and test several NGS workflows and have summarized our experiences from the case study in a recently published paper.
The “Globus Genomics” system simplifies terabyte scale data handling and provides advanced tools for NGS data analysis on the cloud. It offers users the capability to process and transfer data easily, reliably and quickly to address end-to-end NGS analysis requirements. The system is built on Amazon’s cloud computing infrastructure and takes advantage of elastic scaling (i.e., increasing and decreasing compute capacity in response to changing demand) of compute resources to run multiple workflows in parallel to help meet the scale-out analysis needs of modern translational genomics research. It is offered as a service that eliminates the need to install and maintain the software, and allows users to run high performance compute (HPC) workflows on the cloud through graphical interfaces; so users don’t have to worry about any operating complexities.
In the case study, we presented three NGS workflows to illustrate the data management and sharing capabilities of the Globus Genomics system. The NGS workflows were for whole genome (WGS), whole exome (WES) and whole transcriptome (RNA-seq) sequencing data. The workflows involved medium scale data presented through the Globus Genomics architecture; providing a fast and scalable solution for pre-processing, analysis, and sharing of large NGS data sets typical for translational genomics projects. The paper also provided guidance to the users of NGS analysis software on how to address the scalability and reproducibility issues with the existing NGS pipelines when dealing with large volumes of data.
The Globus Genomics system allowed efficient data transfer of large number of samples as a batch; and was able to process 21 RNA-seq samples in parallel (average input size 13.5 GB each paired-end set compressed) in about 20-22 hours generating about 3.2 TB of data. The system also processed 78 WES samples (average input size 5.5 GB each paired-end set compressed) and completed execution on about 12 hours and generated about 3.1 TB of data. This will hopefully allow users to roughly predict the time required to complete processing of raw data given the workflow and size of data. The variant calls or the gene/isoform expression data output from the workflows can be exported from the Globus system and further analyzed at the level of gene, pathways and biological processes relevant to disease outcome.
At the end of the case study, we found the system to be user friendly; we believe its user-interface is suitable for scientists who don’t have programming experience. The system is especially suited for genomics cores that need to process increasing amount of NGS data in a short amount of time, and have to share the processed results with their respective clients.
We hope that the Globus Genomics system and our case study will empower genetic researchers to be able to re-use well known publicly available pipelines or build their own and perform rapid analysis of terabyte scale NGS data using just a web browser in a fully automated manner, with no software installation. The power is now in your hands!
Our case study has enabled an implementation of the Globus Genomics system at the Genomics shared resource at Georgetown. This six-month pilot project that will start beginning of 2015 is a big step for the Georgetown community, and will allow for end-to-end processing of NGS data in-house.
ICBI has come a long way in its NGS data processing and analysis capabilities. Apart from WGS, WES and RNASeq pipelines inside the Globus Genomics system, we also have in-house command line pipelines that have been used in G-DOC Plus, and are continuing our efforts to improve our standing in the NGS community. If you are interested in partnering with us, feel free to contact us at: firstname.lastname@example.org.