Posted by Nathan Edwards
Talk in the NIH Proteomics Interest Group seminar series on November 9th, 2012.
Georgetown University Medical Center, Washington, DC
Functional enrichment analysis is used extensively in systems biology analyses based on transcriptomics data, linking phenotypically distinct experimental samples via differentially expressed genes to knowledgebases that categorize genes by function, cellular location, domain, or canonical pathway. For a variety of reasons, however, these techniques have not been widely used in proteomics. We seek to address the challenges in applying functional enrichment analysis to proteomic data by ensuring proteins are not double-counted due to shared peptides and by using spectral counting to detect proteins with significantly changed abundance.
We propose a more stringent criteria for inferring proteins from bottom-up peptide-fragmentation spectra than traditional parsimony approaches. Parsimony infers all proteins supported by at least one high-confidence peptide identification, while FDR-based filtering permits a controlled but significant number of false identifications in order to boost the number of true identifications. Significantly, peptide identification false-discovery rates are magnified for proteins, as false peptide identifications rarely cluster on the same protein. Further, as more sensitive techniques increase the number of peptide identifications at fixed FDR, the number of false identifications increases, boosting the number of false proteins significantly. Successful protein inference must be false-discovery-rate aware and disregard some high-confidence peptide identifications. We propose a number of generalizations to the traditional protein parsimony problem that ensures that each inferred protein is supported by at least two distinct pieces of peptide evidence and show these generalizations can be readily solved to optimality.
With a more careful set of inferred proteins in hand, we can consider first the question of how to detect differential protein abundance from spectral counts. We consider the application of hypergeometric based statistical models and Fisher exact tests to protein spectral counts, and explore techniques for correcting these models to ensure the resulting p-values do not over-estimate the statistical significance of the observed counts. We next address the question of an appropriate statistical background for the functional enrichment of proteins, and show that with these techniques, we can successfully apply existing tools, including the widely used DAVID tool, for functional enrichment analysis of proteomics data. We also show that we can study gene sets directly, finding evidence for differentially abundant gene sets based directly on spectral counts – avoiding the perils of counting proteins. Lastly, we explore the use of these spectral count based statistical tests for the detection of splicing, even when the underlying set of observed peptides do not provide evidence of distinguishing amino-acid sequence.