Inspect: A Proteomics Search Toolkit

Copyright 2007, The Regents of the University of California

Table of Contents

  • Overview
  • Copyright information
  • Installation
  • Database
  • Searching
  • Analysis
  • Basic Tutorial
  • Advanced Tutorial
  • Unrestricted Search Tutorial

    Overview

    Inspect requires a database (a file of protein sequences) in order to interpret spectra. You can specify one or more databases in the Inspect input file. Databases can be stored in one of two formats: A .trie file (bare-bones format with sequence data only), or a .ms2db file (simple XML format with exon linkage information). These two formats are discussed below.

    Sequence Databases (FASTA)

    For efficiency reasons, Inspect processes FASTA files into its own internal format before searching. A database is stored a two files, one with the extension ".trie" (which holds peptide sequences), and one with the extension ".index" (which holds protein names and other meta-data). To prepare the database, first copy the protein sequences of interest into a FASTA file in the Database subdirectroy. Then, from the Inspect directory, run the Python script PrepDB.py as follows:
        python PrepDB.py FASTA MyStuff.fasta
    Replace "MyStuff.fasta" with the name of your FASTA database. After PrepDB has run, the database files MyStuff.trie and MyStuff.index will be ready to search. PrepDB.py also handles Swiss-prot ".dat file" format as input.

    Inspect can perform this processing automatically (see the "SequenceFile" option in the searching documentation). Running PrepDB.py is the preferred method since it creates a database file which can be re-used by many searches.

    Note: The database should include all proteins known to be in the sample, otherwise some spectra will receive incorrect (and possibly misleading) annotations. In particular, most databases should include trypsin (used to digest proteins) and human keratins (introduced during sample processing). The file "CommonContaminants.fasta", in the Inspect directory, contains several protein sequences you can append to your database.

    Decoy records (ShuffleDB)

    Databases including "decoy proteins" (shuffled or reversed sequences) are emerging as the gold standard for computing false discovery rates. Inspect can compute p-values in two ways:
  • Compute the empirical false discovery rate by counting the number of hits to invalid proteins. This is the recommended method. Given an f-score cutoff, Inspect computes the number shuffled-protein hits above that threshold - these hits are all invalid. Inspect then estimates the number of invalid hits which happen to fall within valid proteins. This count provides an empirical false discovery rate, which is reported as the "p-value".
  • By fitting the distribution of F-scores as a mixture model, in the manner of PeptideProphet. This is how the initial p-values output by inspect are computed. Use PValue.py without the "-S" option to compute p-values using this method.

    To compute empirical false discovery rates:
  • Use the script ShuffleDB.py to append decoy records to a database before searching. Decoy records have the flag "XXX" prefixed to their name.
  • After searching, use the script PValue.py (including the "-S" option) to carry out this analysis.

    MS2DB Format

    The MS2DB file format is a simple, extensible XML format for storing proteins. The main benefits of using MS2DB format instead of FASTA files are:
  • Reduced redundancy - Each exon is stored once, and only once
  • Splice information - All isoforms (and sequence variants) corresponding to a locus are grouped as one Gene, which reduces the usual confusion between proteins and records.
  • Site-specific modifications - Known modifications, such as phosphorylation, can be explicitly indicated. Considering these site-specific modifications is much cheaper than a search that attempts to discover new modifications.
  • Rich annotations - The format has places to store information such as accession numbers from sequence repositories, species name, etc.

    You can use the script BuildMS2DB.jar to generate a MS2DB file. As input, you will need:
  • One or more files in GFF3 format containing exon predictions
  • A FASTA file containing the sequences on which the exons are predicted
  • For more details on using BuildMS2DB.jar (and MS2DBShuffler.jar for building a decoy database) please read the information on proteogenomics found here