Inspect: A Proteomics Search Toolkit
Copyright 2007, The Regents of the University of California
Table of Contents
Overview
Copyright information
Installation
Database
Searching
Analysis
Basic Tutorial
Advanced Tutorial
Unrestricted Search Tutorial
Searching
To run a search, you first create an inspect input file. The input file is text file that
tells Inspect what to do. Each line of the
input file has the form [COMMAND],[VALUE]. For example, one line might be "spectra,spec18.dta",
where the command is "spectra" and the value is "spec18.dta". Inspect ignores blank lines. You can
include comments by starting lines with a hash character (#). Here is an example
of what an input file might look like:
spectra,Fraction01.mzxml
instrument,ESI-ION-TRAP
protease,Trypsin
DB,TestDatabase.trie
# Protecting group on cysteine:
mod,57,C,fix
Here are the
available input file commands. Those you are most likely to set are listed first. The only required
commands are one or more "spectra" commands, and either "db" or "SequenceFile". Commands are
case-insensitive (type "Spectra" or "spectra", it doesn't matter). Values are case-insensitive with
the exception (on Linux) of filenames. If Inspect doesn't understand a command, it will print a
warning and ignore it.
spectra,[FILENAME] - Specifies a spectrum file to search. You can specify the name of a
directory to search every file in that directory (non-recursively).
Preferred file formats: .mzXML and .mgf
Other accepted file formats: .mzData, .ms2 .dta.
Note that multiple spectra in a single .dta file are not supported.
db,[FILENAME] - Specifies the name of a database (.trie file) to search. The .trie file
contains one or more protein sequences delimited by asterisks, with no whitespace or other data.
Use PrepDB.py (see Databases to prepare a database. You should specify
at least one database. You may specify several databases; if so, each database will be searched in turn.
SequenceFile,[FILENAME] - Specifies the name of a FASTA-format protein database to search. If
you plan to search a large database, it is more efficient to preprocess it using PrepDB.py and use the "db"
command instead. You can specify at most one SequenceFile.
protease,[NAME] - Specifies the name of a protease. "Trypsin", "None", and "Chymotrypsin" are
the available values. If tryptic digest is specified, then matches with non-tryptic termini are penalized.
mod,[MASS],[RESIDUES],[TYPE],[NAME] - Specifies an amino acid modification. The delta mass
(in daltons) and affected amino acids are required. The first four characters of the name should be
unique. Valid values for "type" are "fix", "cterminal", "nterminal", and "opt" (the default). For a guide
to various known modification types, consult the following databases:
ABRF mass delta reference
UNIMOD database
RESID database of modifications
Examples:
mod,+57,C,fix - Most searches should include this line. It reflects the addition of CAM
(carbamidomethylation, done by adding iodoacetamide) which prevents cysteines from forming disulfide bonds.
mod,80,STY,opt,phosphorylation
mod,16,M (Oxidation of methionine, seen in many samples)
mod,43,*,nterminal (N-terminal carbamylation, common if sample is treated with urea)
Important note: When searching for phosphorylation sites, use a modification with the name "phosphorylation".
This lets Inspect know that it should use its model of phosphopeptide fragmentation
when generating tags and scoring matches. (Phosphorylation of serine dramatically affects fragmentation, so
modeling it as simply an 80Da offset is typically not sufficient to detect sites with high sensitivity)
Mods,[COUNT] - Number of PTMs permitted in a single peptide. Set this to 1 (or higher) if you
specify PTMs to search for.
Unrestrictive,[FLAG] - If FLAG is 1, use the MS-Alignment algorithm to perform an unrestrictive
search (allowing arbitrary modification masses). Running an unrestrictive search with one mod per peptide is slower than the
normal (tag-based) search; running time is approximately 1 second per spectrum per megabyte of database. Running an unrestrictive search
with two mods is significantly slower. We recommend performing unrestrictive searches against a small database, containing proteins
output by an earlier search. (The "Summary.py" script can be used to generate a second-pass database
from initial search results; see Analysis)
MaxPTMSize,[SIZE] - For blind search, specifies the maximum modification size (in Da) to consider.
Defaults to 250. Larger values require more time to search.
PMTolerance,[MASS] - Specifies the parent mass tolerance, in Daltons. A candidate's
flanking mass can differ from the tag's flanking mass by no more than ths amount. Default value
is 2.5. Note that secondary ions are often selected for fragmentation, so parent mass errors near
1.0Da or -1.0Da are not uncommon in typical datasets, even on FT machines.
ParentPPM,[MASS] - Specifies a parent mass tolerance, in parts per million. Alternative to PMTolerance.
IonTolerance,[MASS] - Error tolerance for how far ion fragments (b and y peaks) can be
shifted from their expected masses. Default is 0.5. Higher values produce a more sensitive but much slower search.
PeakPPM,[MASS] - Specifies a fragment mass tolerance, in parts per million. Alternative to IonTolerance.
MultiCharge,[FLAG] - If set to true, attempt to guess the precursor charge and mass, and consider
multiple charge states if feasible.
Instrument,[TYPE] - Options are ESI-ION-TRAP (default), QTOF, and FT-Hybrid. If set to ESI-ION-TRAP,
Inspect attempts to correct the parent mass. If set to QTOF, Inspect uses a fragmentation model trained on
QTOF data. (QTOF data typically features a stronger y ladder and weaker b ladder than other spectra).
RequiredMod,[NAME] - The specified modification MUST be found somewhere on the peptide.
TagCount,[COUNT] - Number of tags to generate
TagLength,[LENGTH] - Length of peptide sequence tags. Defaults to 3. Accepted values are 1 through 6.
RequireTermini,[COUNT] - If set to 1 or 2, require 1 or 2 valid proteolytic termini. Deprecated, because
the scoring model already incorporates the number of valid (tryptic) termini.
Non-standard options:
TagsOnly - Tags are generated and written to the specified output file. No search is performed.
Command-line arguments
Inspect features a few command-line options. Most options are specified in an input file, rather
than on the command-line. The
command-line options are:
-i Input file name. Defaults to "Input.txt"
-o Output file name. Defaults to "Inspect.txt"
-e Error file name. Defaults to "Inspect.err".
-r The resource directory. Defaults to the current working directory. The resource directory
is where Inspect searches for its resource files such as AminoAcidMasses.txt.
Sample usage:
On Windows: Inspect -i TripureIn.txt -o TripureOut.txt
On Linux: ./inspect -i TripureIn.txt -o TripureOut.txt
Error Reporting
If Inspect encounters a problem - such as a spectrum file with garbled format, or
running out of memory - it reports the problem to the error file. One error (or warning)
is reported per line of the file, and each error/warning type has an ID, to make them
easier to parse. If no error file is left behind after a run, then there were no errors - this
is a good thing!
Here is a sample error message, where I gave inspect an incorrect file name:
[E0008] .\ParseInput.c:725:Unable to open requested file '.\Database\TestDatbaase.trie'
And here is a sample warning message, where - on a small search - Inspect was not able to re-fit the p-value distribution:
{W0010} .\PValue.c:396:Few spectra were searched; not recalibrating the p-value curve.