Posted by Nathan Edwards
Georgetown University Medical Center, Washington, DC
Abstract: Protein inference from bottom-up peptide-fragmentation spectra is a difficult problem. Parsimony approaches infer all proteins supported by at least one high-confidence peptide identification, while FDR-based filtering permits a controlled but significant number of false identifications in order to boost true identifications, resulting in undesirable and counter-intuitive consequences. First, peptide identification false-discovery rates are magnified for proteins, as false peptide identifications rarely cluster on the same protein. Second, as more sensitive techniques increase the number of peptide identifications at fixed FDR, the number of false identifications increases, boosting false proteins significantly. Successful protein inference must disregard some high-confidence peptide identifications.
Unfortunately, filtering one-hit-wonders cannot guarantee inferred proteins are supported by more than a single unique peptide, and do not guarantee that a maximum number of peptides or identifications are covered. Furthermore, filtering one-hit-wonders cannot determine the minimum number of proteins that leaves a fixed percentage of peptide identifications unexplained. We propose a number of generalizations to the protein parsimony problem that provide such guarantees and show that the resulting instances can be readily solved to optimality using a variety of preprocessing techniques and a branch-and-bound search strategy.
In one experiment, 92985 LCQ spectra of 18 standard proteins and 15 known contaminants were searched and filtered at 10% FDR, for 26065 peptide identifications representing 2531 peptides on 4927 SwissProt protein accessions. Traditional parsimony inferred 1118 proteins, with 1009 supported by a single peptide. Eliminating one-hit-wonders before parsimony inferred 208 proteins, with 100 supported by a single peptide. Eliminating single-peptide proteins after either analysis infers 108 proteins. The generalized parsimony solution that leaves 10% unexplained peptide identifications with at least two unique peptides per inferred protein requires just 43 proteins, with each protein supported by 3 or more unique peptides. Generalized protein parsimony reduces false protein identifications by optimally choosing identifications to ignore.