• Home
  • PepArML Meta-Search Tutorial and Worked Example

    Posted by

    This post provides a tutorial and worked example in the use of the PepArML meta-search engine for peptide identification from tandem mass spectra using sequence database search.

    In order to follow along with the tutorial, you’ll need to register and login to PepArML.

    We’ll be analyzing a small set of LC-MS/MS spectra, generated using a Micromass QTOF Ultima mass-spectrometer, available at one time from the Sashimi data-repository, called 17mix_test2.mzXML.gz. This dataset is referred to as S17 in the PepArML manuscript [1] and posters and talks available from the Edwards Lab publications page. To download, right-click on the 17mix_test2.mzXML.gz hyperlink and select “Save Link As…” (Firefox) or “Save Target As…” (Internet Explorer), and save somewhere convenient, such as “Desktop” or “My Documents”. It should be about 12.2 MB in size on your computer. If you have trouble downloading a complete version to your computer, try this link: 17mix_test2.mzXML.gz.

    The basic steps of this tutorial will follow the path of a normal PepArML analysis: upload the spectra to be analyzed,  configure the sequence database search parameters, monitor the progress of the jobs, configure the PepArML combiner, and interpret the results. We elaborate on each of these steps.

    Upload spectra for analysis

    Spectra are uploaded to the spectra repository, where the spectra to be analyzed by PepArML are stored for analysis.

    1. Log into PepArML and select the Spectra tab.
    2. Create a folder for the study that generated the spectra. Type the study name, we’ll use “Tutorial”, in the New Folder text entry box, and click the Create button.
    3. Navigate into the new Tutorial folder by clicking on the hyperlink Tutorial. The header should show the folder location and name as /users/<username>/Tutorial, where <username> is your PepArML User Name.
    4. Upload the 17mix_test2.mzXML.gz spectra datafile to the Tutorial folder. Click the Browse button in the Spectra Upload dialog and select the 17mix_test2.mzXML.gz file from your local computer, then click the Upload button. Once uploaded, the line corresponding to 17mix_test will show its size (12 MB) and the number of MS/MS spectra (1389) contained in the file. PepArML can handle most “open” spectral datafile formats in common usage for peptide identification, including mzXML, mzData, and mgf, and can work natively with uncompressed, gzip (extension .gz), or bzip2 (extension .bz2)  compressed files.
    5. If the spectrum file does not appear, left-click anywhere on the first row of the table with name “..” (but not on the name “..” itself), and select Reload from the popup menu.
    6. Navigate to your home folder /users/<username> by clicking on the “..” hyperlink or the Spectra tab.

    Configure the sequence database search

    We will conduct a semi-tryptic search against the SwissProt protein sequence database using five search engines: KScore (X!Tandem with the K-Score scoring plugin), Mascot, Myrimatch, OMSSA, and Tandem (X!Tandem with native scoring).

    1. Select the Tutorial study’s spectral files for search. Left-click on the row corresponding to the Tutorial study’s folder, but not on the Tutorial hyperlink, and select Search from the popup menu.
    2. You will be taken to the Search tab with the Spectra field pre-populated with the value /users/<username>/Tutorial, which specifies that all spectra files in the Tutorial folder should be searched.
    3. All the search engines’ check-boxes will be checked. We will leave all of these check-boxes in their current checked state.
    4. Select the appropriate instrument name/type from the Instrument options. For the 17mix_test2 dataset, the Waters Q-TOF instrument setting is appropriate. Select the Waters Q-Tof instrument.
    5. Select the appropriate proteolytic agent from the Proteolytic Agent options. The default selection, Trypsin, is appropriate for many proteomics experiments, including the 17mix_test2 dataset, so we leave this option at its default value.
    6. Select the appropriate fixed post-translational modifications (or mass modifications) representing known deterministic mass modifications due to sample preparation effects. The default selection, Carbamidomethyl (C), is appropriate for many proteomics experiments, including the 17mix_test2 dataset,  in which idoacetamide is used to alkylate Cysteines after they are reduced, so we leave these selections at their default values.
    7. Select the appropriate variable post-translational modifications (or mass modifications) representing potential mass modifications due to sample handling or biological pathways. The default selections, MetOx (M) (for methionine oxidation), Gln->pyro-Glu (N-term Q), Glu->pyro-Glu (N-term E), Pyro-carbamidomethyl (N-term C), are appropriate for many datasets, including the 17mix_test2 dataset. The consideration of the variable “pyro” artifactual mass modifications is hard-coded in  X!Tandem, so including these modifications helps to achieve consensus peptide identifications for these peptides.
    8. Select the appropriate sequence database to search. The 17mix_test2 dataset contains proteins from a variety of organisms, so we select the SwissProt protein sequence database.
    9. Select the appropriate Peptide Candidate criteria. The three defaults, Specific, Semispecific, and Nonspecific, represent precursor mass tolerance of 2Da; using the precursor charge state declared in the spectrum file, if present, and guessing +1 or +2/+3, if not; at most 1 missed cleavage; and peptides with two, one, or no termini consistent with the selected proteolytic agent. For the dataset 17mix_test2, we will use the Semispecific option in the Peptide Candidate Selection field.
    10. The Spectra field should already be populated. You can enter a folder or spectra name here too, and the field will show valid completions from the Spectra repository. The value in this field will generally start with /users/<username>.
    11. The Search Number field can be left at the default value of 1. If additional searches, using different search parameters, were carried out against the same spectral datafiles, these should have a different Search Number value.
    12. The Decoy Replicates selection can be left at its default value of 2. PepArML requires at least 2 (shuffled) decoy replicates for each search. If additional decoy replicates are selected, this will increase the precision of the FDR estimation process, but require a significant number of additional searches be carried out.
    13. The Search Chunk Size selection should be set to the value 200. Ideally, each search chunk should take about 10 mins or so to complete. While this is difficult to estimate ahead of time, 200 seems to be a good number for semi-tryptic searches of SwissProt while 500 seems to be a good number for tryptic searches of SwissProt.
    14. All done! Click the Submit Query button. If there are errors, the form will show the problem in red next to the problematic value. If there are no errors, the form will return with no red indicators, filled in for your next submission.
    15. Click the Queue tab to observe the growing queue of waiting search jobs. If no jobs appear after a minute or so, use the back button to check for and correct any errors in your search parameters.

    Monitor the search jobs

    Search jobs are pulled down from the PepArML scheduler by worker computers running both locally on the Edwards lab cluster and remotely on resources such as the Purdue TeraGrid Condor pool. The status of search jobs can be monitored using the Queued (jobs waiting to run), Running (jobs in progress), Error (failed or crashed jobs), and Done (finished jobs) tabs.

    When there is no search jobs to be run, the worker computes check infrequently (about every 10 minutes or so) to see if new work is available, so it can take a few minutes before any of your jobs start to run. Once there is work available, the worker computers will gradually join the effort, until all the available compute resources are involved. You can observe this gradual ramp-up under the Running tab. Note however, that compute resources are shared amongst all users with queued jobs, so your search jobs may not run on all of the available machines. You can see a variety of job characteristics by clicking on the job id number hyperlink, including the search configuration for that search job. Finally, many additional fields (columns) associated with the search jobs and sorting and filtering criteria can be applied by left-clicking on the column headers and selecting appropriate options from the pop-up menu.

    Jobs run in a heterogeneous grid of computation resources can fail for a variety of reasons, from inability to access networked filesystems, lack of disk-space, or pre-emption by higher priority users.  Failed jobs are listed under the Error tab, marked Error and where possible, the error message generated is shown. Usually the error will be due to some problem with the worker computer, and the jobs merely re-queued to be run again. Running jobs contact the scheduler every minute or so, and a “Heartbeat” recorded as part of the job’s status. Running jobs that vanish, without a heartbeat for more than 10 minutes, are marked Crashed. These jobs should also be re-queued. Under the Error tab, jobs can be re-queued by left-clicking anywhere on any row of the table (except for the id number hyperlink) and selecting Requeue Page or Bulk Requeue from the pop-up menu. Requeued jobs are usually picked up by different worker computers and complete successfully. Jobs which fail repeatedly may have a bad configuration or crash a particular search engine deterministically.

    You can check the progress of all search jobs by checking the Results tab. All search job results are created as an empty file that is overwritten when the results become available. Each file is marked 1 (empty) or 0 (not empty) in the Empty column, while each folder shows the number of empty results it contains. Once a folder shows 0 empty results, all search jobs are complete. All tabs (except for the Home and User tabs) refresh every thirty seconds, ensuring that the displayed data is up-to-date. You can navigate into folders by clicking on the hyperlinked name, as for the Spectra repository. You can sort and filter the results in a folder by left-clicking on the column headers and selecting appropriate options from the pop-up menu. A left-click on the Empty column header, followed by selecting Filter and True, will filter the results to show only the empty ones – this is a good way to see whether problematic or slow search jobs are related to one search engine or spectrum file.

    Once all of the search jobs are complete, the results can be reconciled and merged using the PepArML result combiner.

    Configure the PepArML Combiner

    The PepArML result combiner will merge all of the search results together, use consensus to improve peptide identification sensitivity, and estimate the statistical significance of the end-result using the decoy searches.

    1. Select the Tutorial study’s search results for combining. Left-click on the row corresponding to the Tutorial study’s folder under the Results tab, and select Combine from the pop-up menu.
    2. You will be taken to the Combine tab with the Results Name field pre-populated with the value /users/<username>/Tutorial, which specifies that all search results in the Tutorial folder should be combined.
    3. Select the Results Number. This specifies the search results number (1, 2, etc…) whose results should be combined. The default Results Number corresponds to the minimum Results Number available for the selected folder, and since we’ve only done one batch of searches for this tutorial, the selection can be left as is.
    4. Specify which search engine’s results should be combined. You can select any combination of the search engines that have computed search results, but typically all available engines’ results would be combined. Since we have search results from all the listed search engines we leave all the check-boxes checked.
    5. The Sequence Database field should be set to the same value used for the search jobs, specifically SwissProt for this tutorial.
    6. The Combiner Heuristic determines the result combining techniques to be used. The dominant compute effort is in the PepArML combiner, so the others take little additional time to compute and provide a good yard-stick to measure the PepArML results by. Typically all Combiner Heuristic check-boxes are left checked.
    7. PepArML allows multiple combining runs (named for the result folder) to exist in the Results repository at the same time. This can be left at the default value 1, which is the next unused Result Number.
    8. All done! Click the Submit Query button. If there are errors, the form will show the problem in red next to the problematic value. If there are no errors, the form will return with no red indicators, filled in for your next submission.
    9. Click the Queue tab to observe the waiting PepArML combiner job. If the job does not appear under the Queue or Running tab after a minute or so, use the back button to check for and correct any errors in your search parameters. Note that PepArML combiner jobs take preference over search jobs, so the jobs may start almost immediately.
    10. The job monitoring tabs and the Results tab can be used to monitor the PepArML combiner job. Once complete, the PepArML result file can be downloaded and interpreted.

    PepArML results interpretation

    Click on the PepArML results hyperlink (Tutorial under the Results tab) to download the Tutorial.peparml.0.zip file containing the different combining technique results and a summary spreadsheet. An example of the Tutorial results is also available at the link Tutorial.peparml.0.zip. The summary spreadsheet (stats.csv) shows the number of spectra each technique (E-values, FDR Heuristic, and PepArML) was able to assign at 10% est. FDR.

    There is sometimes a little variation in the number due to the various sources of randomness in the PepArML machine-learning combiner, the values in the stats.csv file should look something like the following table.

    Combiner Spectra at 10% est. FDR
    kscore 192
    mascot 194
    myrimatch 95
    omssa 121
    tandem 177
    heuristic 217
    peparml 350

    The performance of the various techniques can also be visualized as a curve, with est. FDR on the x-axis and spectra, and distinct peptides, assigned at that threshold.


    For each technique a <name>-pred-efdr.csv and <name>-prot.csv file is computed. These files can be loaded into Excel just by double-clicking on them. The <name>-pred-efdr files lists the peptide-spectrum assignments (PSM), one per spectrum. The first few columns describe the spectrum (spectra_set, start_scan, end_scan), the peptide (peptide sequence, mods), and the protein accessions. The next columns list the various features extracted from each search engine. <feat>-xn indicates a feature (such as eval), in search engine x, instance n. Search engine keys: Mascot: m, Tandem: t, OMSSA: o, MyriMatch: m, KScore: k. The sentinel feature is a 0/1 feature indicating whether or not the search engine’s features are present or absent for a particular spectrum-peptide pair. Missing values are indicated by a “-100”. The decoyhits feature supports the FDR Heuristic and the initial heuristic PepArML protein selection. Following the search engine features are the features available for all peptide-spectrum-assignments (PSMs) like precursor, charge state, peptide length, number of agreeing search engines, tryptic termini features, and min decoy hits.

    These result  files can readily manipulated by loading them into Excel, selecting all the data (Ctrl-A), and using the AutoFilter feature (under Data, Filter) to slice and dice the spreadsheet. The last column of the row, estfdr, stands for estimated false discovery rate. For each PSM, this value provides the estimated FDR for all peptide identifications at this prediction confidence or E-value or better. The best peptide identifications can be identified by using the AutoFilter pull-down menu on the estfdr column and choosing Sort Ascending, or (Custom…) -> less than or equal to 0.1.

    The <name>-prot.csv file is similar, but grouped by protein. Here proteins with at least 2 distinct peptides at 10% FDR are selected and the peptide assignments grouped with their proteins. Protein statistics include number of distinct and number of non-overlapping peptides, and % coverage, for the peptide identifications at 10% FDR only. All peptide-spectrum assignments are retained, and shown underneath their proteins, regardless of their FDR. Columns are as for <name>-pred-efdr.csv file, with a few extras details. First, proteins with _exactly_ the same peptide hits (less than 10% FDR and more than 10% FDR) are collapsed together (indicated by =>). Proteins with a strict peptide subset are collapsed and indicated by “->”. The primary protein is indicated “>>”. Proteins whose peptides (strong and weak identifications) that are not a subset get a separate entry. Peptides are marked, in the “first” column, with a “*” in the first (reading from the top) protein cluster they appear in, otherwise this field is left blank (similar to Mascot’s bold). Peptide start and end positions, relative to the cluster’s primary protein entry, are also shown here, and each protein’s peptides are sorted by the position.

    You can select the header row and peptide-spectrum assignment rows of a single protein’s peptide identifications and AutoFilter on these rows alone – this is an easy way to filter out peptide assignments that have poor scores or significances in a single protein cluster.

    Significant peptide assignments may be omitted from the protein report if none of the proteins containing them have two or more distinct peptides at 10% FDR.


    This tutorial and worked example has demonstrated most of the commonly used features of the PepArML meta-search engine. Please comment, or send me email, if you discover errors in the prose or details of the tutorial, or if you have suggestions on how to make the tutorial more useful.


    [1] N. Edwards, X. Wu, and C.-W. Tseng.An Unsupervised, Model-Free, Machine-Learning Combiner for Peptide Identifications from Tandem Mass Spectra.” Clinical Proteomics 5.1 (2009): 23-36.

    Leave a Reply