Posted by Nathan Edwards
Georgetown University Medical Center, Washington, DC
Protein inference from bottom-up peptide-fragmentation spectra is a difficult problem due to shared peptides. Parsimony approaches infer all proteins supported by at least one high-confidence peptide identification, while FDR-based filtering permits a controlled but significant number of false identifications in order to boost the number of true identifications. Significantly, peptide identification false-discovery rates are magnified for proteins, as false peptide identifications rarely cluster on the same protein. Further, as more sensitive techniques increase the number of peptide identifications at fixed FDR, the number of false identifications increases, boosting the number of false proteins significantly. Successful protein inference must be false-discovery-rate aware and disregard some high-confidence peptide identifications.
Traditional parsimony and probabilistic schemes, with or without additional heuristics, provide no guarantees that inferred proteins are supported by more than one significant (unshared) peptide identification and cannot match the number of unassigned peptide identifications to the desired (spectrum-level) FDR. We propose a number of generalizations to the protein parsimony problem that provide such guarantees and show they can be readily solved to optimality.
In one experiment, 162,420 LTQ MS/MS spectra from a yeast cell-lystate were searched against the SGD using X!Tandem (no refinement) and filtered at 1% FDR, resulting in 34,113 peptide identifications for 4702 peptides on 1226 proteins. Traditional parsimony and Protein Prophet inferred 476 and 462 proteins supported by a single peptide, respectively. Eliminating one-hit-wonders before traditional parsimony left 30 single peptide proteins. Requiring two unshared peptides per protein and maximizing the number of covered identifications, generalized parsimony inferred 620 proteins, 11 more than the other techniques. Furthermore, the specificity afforded by requiring at least two unshared peptides per protein makes it possible to relax the FDR filtering criteria to 3%, with generalized parsimony inferring 687 proteins, each with two unshared peptides, appropriately leaving 3% of the identifications uncovered.