Posted by Nathan Edwards
I gave a talk in the Monday morning “Bioinformatics/Statistics” session at the 7th Annual Conference of US HUPO, presenting the culmination of work by David Retz for his Masters practicum project in the Biostatistics department.
Novel empirical FDR estimation techniques in the PepArML meta-search peptide identification platform
David Retz and Nathan Edwards
The PepArML meta-search peptide identification platform provides a unified search interface to seven search engines; a robust cluster, grid, and cloud computing scheduler for carrying out large-scale target and decoy searches; and an unsupervised model-free machine-learning results combiner, which selects the best peptide identification for each spectrum, estimates false-discovery rates, and formats the unified results in pepXML format. The PepArML meta-search platform typically identifies 2-3 times more spectra than individual search engines at 10% FDR.
The meta-search platform supports Mascot; X!Tandem with native, k-score, and s-score scoring; OMSSA; MyriMatch; and InsPecT with MS-GeneratingFunction spectral probability scores — reformatting spectral data and constructing search configurations for each search engine on the fly. Shared and user-provided compute resources, including clusters, heterogeneous computational grids, and cloud computing are supported. The scheduler has easily managed hundreds of simultaneous search jobs running on a combination of local cluster and Amazon Web-Services cloud computing resources.
The unsupervised model-free machine-learning combiner selects the best peptide identification for each spectrum based on search engine results, plus spectrum, peptide, and sample preparation features. In addition to search results, the machine-learning algorithm is provided features that model digestion, retention time, precursor isotope clusters, mass accuracy, and proteotypic properties. The unsupervised PepArML training heuristic requires no prior knowledge of the performance, utility, or appropriate weighting of these features for application to a particular dataset.
The search-engine agnostic false-discovery-rate computation permits apples-to-apples comparison of search-engines and FDR estimation techniques. We use this infrastructure to study the effect of reversed and shuffled decoy sequence databases, and unified, charge-state, and peptide length partitioned FDR estimation. We also evaluate a variety of non-decoy FDR estimation techniques, demonstrating successful FDR estimation using a novel approach. This novel technique eliminates at at least one of the two decoy searches currently required by PepArML, saving considerable compute resources.
The slides for this talk can be downloaded from the Talks section of the publications page of the Edwards lab website. In addition, a poster on this research, presented at the RECOMB Satellite Conference on Computational Proteomics in San Diego, can be found in the Posters section of the same page. You can try PepArML here. The novel FDR estimation procedure will be implemented the public PepArML infrastructure in the next week or so, marked as an experimental feature.