Nate Silver is my new hero. His prediction (well ahead of other political analysts and media outlets) of President Obama’s victory in 2012 exemplifies the various facets of data science – data collection, pre-processing, filtering, analyzing, and presenting information – almost in real-time. His simple prediction model and detailed data presentation techniques have inspired and amazed data scientists across multiple domains such as health care, biomedical research, sports analysis, politics, astronomy, and many others. We clearly live in a data-driven economy. If you haven’t gotten enough of the statistics on big data in health care, here are a few more. U.S. health care data is growing at the rate of 30 petabytes per year. Global health data size is estimated at approximately 150 exabytes, growing at 1.2 to 2.4 EB/year. The potential value of healthcare data, either through pharmaceutical product development dollars or reimbursement gains, is estimated at $300 billion annually.
So, why should we care about all this? As biomedical researchers, we not only curate big data but also play important roles as analysts, interpreters, and decision makers. As costs of big data generation drop, techniques such as targeted and whole-genome sequencing, RNA-Seq, Chip-Seq, miRNA-Seq and others are proving to be quite useful in the identification of novel and rare anomalies associated with disease, gene expression signatures, and functions of non-coding RNAs in tissue and blood. We will take a deeper dive into one of these techniques – RNA-Seq – and review its data analysis challenges and opportunities.
Although RNA-Seq was developed in 2008, the bioinformatics methods to analyze these data continue to evolve. We have come a long way from testing changes in expression of only a few genes using low-throughput techniques such as RT-PCR. The use of microarrays to study gene expression on a genome-wide scale has become the primary high-throughput method to study gene expression over the past decade. Yet this method has many shortcomings, including the inability to identify novel transcripts, a limited dynamic range for detection, and difficulty in reliability and cross-experimental comparisons. RNA sequencing overcomes many of these problems. High-throughput next-generation sequencing methods to sequence the entire transcriptome permit both transcript discovery and robust digital quantitation of gene expression levels.
The bioinformatics tools can be categorized based on the applications of RNA-Seq data and the questions we want to ask of the data. Current applications and related tools are listed below for ease of access. Note that tools and software continue to evolve and improve.
- Read mapping – Transcriptome sequencing reads are usually first mapped to the genome or the transcriptome sequences, and read alignment is a basic and crucial step for the mapping-first based analytical methods. The complexities of genome sequences have direct influences on the mapping accuracy of short reads. Large genomes with repetitive and homologous sequences make it difficult to perform short read mapping. Also, as introns and exons vary in length, accurate mapping is necessary to identify true boundaries. Tools for read mapping include among others Bowtie, BWA, and SOAP2.
- Splice junction detection – Alternative splicing is very common in the gene transcriptional process of eukaryotes, and is very important for the genomes to generate various RNAs (both protein-coding and non-protein-coding) to ensure proper molecular functioning. RNA splicing can be described as the primary challenge to correctly map the sequence reads that cover splice junctions to reference sequences. To identify the splice junctions between exons, the software must support spliced mapping for reads, because the reads across the splice junctions need to be split into smaller segments, and then mapped to different exons by crossing-checking with possible introns. Tools for splice junction detection include, among others, TopHat, MapSplice, and SpliceMap.
- Gene and isoform expression testing – With microarrays, we are limited to quantifying expression only at the gene level. By contrast, RNA-Seq can estimate expression at both gene and isoform level. To comprehensively understand the transcriptome, it is important to study expression at the gene isoform level. RNA-Seq can also help detect unannotated genes and isoforms for any species while microarrays depend on prior information from known genes. Tools for genes and isoform quantitation from RNA-Seq include, among others, Cufflinks, MISO, and Scripture.
- Differential expression analysis – RNA-Seq can be used to detect both differentially expressed genes and isoforms, while microarrays are limited for differentially expressed genes. Since genes with multiple exons can encode different functional isoforms, this is an important factor to consider when selecting the proper technologies for research. Although it is still relatively more costly to sequence multiple samples than microarrays, RNA-Seq will inevitably and eventually replace microarrays. While RNA-Seq provides a digital count of genes and isoforms that help quantify expression levels, several RNA-Seq biases should be taken into account such as sequencing depth, count distribution among samples, and length of genes and transcripts. Tools for differential expression analysis from RNA-Seq include, among others, Cufflinks, bayseq, and DESeq.
Once we complete pre-processing and gene expression analysis, a number of downstream analyses can follow depending on the questions we want to answer for that particular dataset. Such analyses may involve functional enrichment, network inference, integration with other data types that will ultimately lead to biological insights, and new hypothesis generation. Software tools such as Ingenuity, Partek, Pathway Studio and many others help with downstream analysis of RNA-Seq data. Tools aggregators or workflow developers such as Globus Genomics combine a number of these tools into readily usable data analysis pipelines.
In addition to basic science applications, RNA-Seq has the potential to become a clinically applicable technology. In disease classification and diagnosis, RNA-Seq could provide a powerful tool for high-resolution genomic analysis of human tissue samples and cell populations to identify novel mutations and transcripts in cancers, to classify tumors based on gene expression patterns, or to identify microbial pathogens based on sequence identification. While the sensitivity of this method lends itself nicely to clinical use, challenges associated with small sample sizes, data analyses and interpretation, and education of clinical personnel must be overcome before it can be broadly used in that setting. Still, the day we will routinely use RNA-Seq and/or similar methods clinically in the practice of precision medicine is not far off. Let’s continue the conversation – find me on e-mail at email@example.com or on twitter at @subhamadhavan.