Sep 16 2014

Biomedical Data Science MeetUp

by at 5:05 pm

We were delighted to have Dr. Warren Kibbe, Director of NCI’s Center for Biomedical Informatics and Information Technology (CBIIT) kick off the discussion at ICBI’s first MeetUp on Biomedical Data Science in June.  Dr. Kibbe gave a lightning talk about a national learning health system for cancer genomics where we can learn from every patient who comes into a doctor’s office for treatment.  Although many patients support more data sharing and will consent to their de-identified genomic data being used for research it’s still mired in privacy issues, Dr. Kibble stated.  We need to lower barriers to accessing patient data.  Dr. Kibble spoke about the HHS Blue Button initiative, which will enable patients to access and download their electronic health record (EHR) data and release their information freely to doctors and others.  He also spoke about the cancer cloud pilot initiative at NCI in which public data repositories will be co-located with advanced computing resources to enable researchers to bring their tools and methods to the data essentially democratizing access to troves of data being generated by the scientific community.

Dr. Yuriy Gusev, Sr. Bioinformatics Scientist at ICBI, next discussed large-scale translational genomics research on the cloud as the second lightening talk of the MeetUp. He presented research at ICBI utilizing genomics data produced by next generation sequencing technologies including whole exome sequencing, whole genome sequencing, RNA-seq, miRNA seq, and an area we hope to get into in the future – epigenomics.  The projects he discussed involve patient data from 40-2000 patient samples.  He focused on novel applications of RNA sequencing for disease biomarker discovery and molecular diagnostics and emphasized the need for platforms allowing for scalability such as cloud computing provided by Amazon Web Services.

The meeting took place at Gordon Biersch in Rockville Town Center, which turned out to be too loud for the discussion but had good beer, of course, on the upside and provided a nice venue for networking.

If you are in the DC area please join us for the next MeetUp on September 24 at the Rockville Library (spaces are limited to the first 50 registrants).  For details visit:

No responses yet | Categories: MeetUp | Tags: , , , , , , , , ,

Sep 13 2013

ICBI Director’s blog post, Fall 2013

by at 4:34 pm

Nate Silver is my new hero. His prediction (well ahead of other political analysts and media outlets) of President Obama’s victory in 2012 exemplifies the various facets of data science – data collection, pre-processing, filtering, analyzing, and presenting information – almost  in real-time.  His simple prediction model and detailed data presentation techniques have inspired and amazed data scientists across multiple domains such as health care, biomedical research, sports analysis, politics, astronomy, and many others. We clearly live in a data-driven economy. If you haven’t gotten enough of the statistics on big data in health care, here are a few more.  U.S. health care data is growing at the rate of 30 petabytes per year. Global health data size is estimated at approximately 150 exabytes, growing at 1.2 to 2.4 EB/year. The potential value of healthcare data, either through pharmaceutical product development dollars or reimbursement gains, is estimated at $300 billion annually.

So, why should we care about all this? As biomedical researchers, we not only curate big data but also play important roles as analysts, interpreters, and decision makers. As costs of big data generation drop, techniques such as targeted and whole-genome sequencing, RNA-Seq, Chip-Seq, miRNA-Seq and others are proving to be quite useful in the identification of novel and rare anomalies associated with disease, gene expression signatures, and functions of non-coding RNAs in tissue and blood. We will take a deeper dive into one of these techniques – RNA-Seq – and review its data analysis challenges and opportunities.

Although RNA-Seq was developed in 2008, the bioinformatics methods to analyze these data continue to evolve. We have come a long way from testing changes in expression of only a few genes using low-throughput techniques such as RT-PCR. The use of microarrays to study gene expression on a genome-wide scale has become the primary high-throughput method to study gene expression over the past decade. Yet this method has many shortcomings, including the inability to identify novel transcripts, a limited dynamic range for detection, and difficulty in reliability and cross-experimental comparisons. RNA sequencing overcomes many of these problems. High-throughput next-generation sequencing methods to sequence the entire transcriptome permit both transcript discovery and robust digital quantitation of gene expression levels.

The bioinformatics tools can be categorized based on the applications of RNA-Seq data and the questions we want to ask of the data. Current applications and related tools are listed below for ease of access. Note that tools and software continue to evolve and improve.

  1. Read mapping – Transcriptome sequencing reads are usually first mapped to the genome or the transcriptome sequences, and read alignment is a basic and crucial step for the mapping-first based analytical methods. The complexities of genome sequences have direct influences on the mapping accuracy of short reads. Large genomes with repetitive and homologous sequences make it difficult to perform short read mapping. Also, as introns and exons vary in length, accurate mapping is necessary to identify true boundaries. Tools for read mapping include among others BowtieBWA, and SOAP2.
  2. Splice junction detection – Alternative splicing is very common in the gene transcriptional process of eukaryotes, and is very important for the genomes to generate various RNAs (both protein-coding and non-protein-coding) to ensure proper molecular functioning. RNA splicing can be described as the primary challenge to correctly map the sequence reads that cover splice junctions to reference sequences. To identify the splice junctions between exons, the software must support spliced mapping for reads, because the reads across the splice junctions need to be split into smaller segments, and then mapped to different exons by crossing-checking with possible introns. Tools for splice junction detection include, among others, TopHatMapSplice, and SpliceMap.
  3. Gene and isoform expression testing – With microarrays, we are limited to quantifying expression only at the gene level. By contrast, RNA-Seq can estimate expression at both gene and isoform level. To comprehensively understand the transcriptome, it is important to study expression at the gene isoform level. RNA-Seq can also help detect unannotated genes and isoforms for any species while microarrays depend on prior information from known genes. Tools for genes and isoform quantitation from RNA-Seq include, among others, Cufflinks, MISO, and Scripture.
  4. Differential expression analysis – RNA-Seq can be used to detect both differentially expressed genes and isoforms, while microarrays are limited for differentially expressed genes. Since genes with multiple exons can encode different functional isoforms, this is an important factor to consider when selecting the proper technologies for research. Although it is still relatively more costly to sequence multiple samples than microarrays, RNA-Seq will inevitably and eventually replace microarrays. While RNA-Seq provides a digital count of genes and isoforms that help quantify expression levels, several RNA-Seq biases should be taken into account such as sequencing depth, count distribution among samples, and length of genes and transcripts. Tools for differential expression analysis from RNA-Seq include, among others, Cufflinks, bayseq, and DESeq.

Once we complete pre-processing and gene expression analysis, a number of downstream analyses can follow depending on the questions we want to answer for that particular dataset. Such analyses may involve functional enrichment, network inference, integration with other data types that will ultimately lead to biological insights, and new hypothesis generation. Software tools such as Ingenuity, Partek, Pathway Studio and many others help with downstream analysis of RNA-Seq data. Tools aggregators or workflow developers such as Globus Genomics combine a number of these tools into readily usable data analysis pipelines.

In addition to basic science applications, RNA-Seq has the potential to become a clinically applicable technology. In disease classification and diagnosis, RNA-Seq could provide a powerful tool for high-resolution genomic analysis of human tissue samples and cell populations to identify novel mutations and transcripts in cancers, to classify tumors based on gene expression patterns, or to identify microbial pathogens based on sequence identification. While the sensitivity of this method lends itself nicely to clinical use, challenges associated with small sample sizes, data analyses and interpretation, and education of clinical personnel must be overcome before it can be broadly used in that setting. Still, the day we will routinely use RNA-Seq and/or similar methods clinically in the practice of precision medicine is not far off. Let’s continue the conversation – find me on e-mail at or on twitter at @subhamadhavan.

No responses yet | Categories: From the director's office,Newsletter,Subha Madhavan | Tags: , , , , , ,