./dripDigest.py [options] --fasta <protein database>
DripDigest digests a protein database (FASTA file) efficiently given a fixed memory budget. The resulting peptide database is output in a compact binary format utilized by dripSearch of the DRIP toolkit. If you use dripDigest in your research, please cite:
John T. Halloran, Jeff A. Bilmes, and William S. Noble. "Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry". Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI 2014). AUAI, Quebic City, Quebec Canada, July 2014.
--digest-dir <string> – directory to write the
resulting peptide database to. Default
= dripDigest-output.
<protein database> – FASTA file.The program writes files to the folder dripDigest-output by
default. The name of the output folder can be set by the user using
the --digest-dir option. The following files will be
created:
target.txt –
a tab-delimited text file containing the digested peptides
if --peptide-list is set to true.
decoy.txt –
a tab-delimited text file containing the shuffled digested
peptides (decoys) if --peptide-list
and --decoys are set to true.
targets.bin –
binary file containing digested peptides.
decoys.bin –
binary file containing permuted target peptides. This file will
only be created if the --decoys option is set to
True.
--decoys <T|F> –
whether to create (shuffle target peptides) and search decoy
peptides. Default = T.
--recalibrate <T|F> – whether to create second
set of decoys (shuffling only). Default = False
--peptide-buffer <int> –
The maximum number of peptides to keep in memory. Default
= 100000.
--max-length <integer>
– The maximum length of peptides to consider. Default
= 50.
--max-mass <float> –
The maximum mass (in Da) of peptides to consider. Default
= 7200.
--min-length <integer> –
The minimum length of peptides to consider. Default
= 6.
--min-mass <float> –
The minimum mass (in Da) of peptides to consider. Default
= 200.
--monoisotopic-precursor <T|F> –
When computing the mass of a peptide, use monoisotopic masses
rather than average masses. Default = true.
--mods-spec <string> –
The general form of a modification specification has three
components, as exemplified by 1STY+79.966331.C+57.02146.
--cterm-peptide-mods-spec <string> –
Specify peptide c-terminal modifications. See
nterm-peptide-mods-spec for syntax. Default
= <empty>.
--nterm-peptide-mods-spec <string> –
Specify peptide n-terminal modifications. Like --mods-spec, this
specification has three components, but with a slightly different
syntax. The max_per_peptide can
be either "1", in which case it defines a variable terminal
modification, or missing, in which case the modification is
static. The residues field
indicates which amino acids are subject to the modification, with
the residue X corresponding to
any amino acid. Finally, added_mass is defined as before. Default
= <empty>.
--max-mods <integer> –
The maximum number of modifications that can be applied to a
single peptide. Default = 255.
--min-mods <integer> –
The minimum number of modifications that can be applied to a
single peptide. Default = 0.
--decoy-format
<shuffle|peptide-reverse>
–
Include a decoy version of every peptide by shuffling or reversing
the target sequence or protein. In shuffle or peptide-reverse mode,
each peptide is either reversed or shuffled, leaving the N-terminal
and C-terminal amino acids in place. Note that peptides appear
multiple times in the target database are only shuffled once. In
peptide-reverse mode, palindromic peptides are shuffled. Also, if a
shuffled peptide produces an overlap with the target or decoy
database, then the peptide is re-shuffled up to 5 times. Note that,
despite this repeated shuffling, homopolymers will appear in both
the target and decoy database. The protein-reverse mode reverses the
entire protein sequence, irrespective of the composite
peptides. Default = shuffle.
--keep-terminal-aminos <N|C|NC|none> –
When creating decoy peptides using decoy-format=shuffle or
decoy-format=peptide-reverse, this option specifies whether the
N-terminal and C-terminal amino acids are kept in place or allowed
to be shuffled or reversed. For a target peptide "EAMPK" with
decoy-format=peptide-reverse, setting keep-terminal-aminos to "NC"
will yield "EPMAK"; setting it to "C" will yield "PMAEK"; setting
it to "N" will yield "EKPMA"; and setting it to "none" will yield
"KPMAE". Default = NC.
--seed <string> –
When given a unsigned integer value seeds the random number
generator with that value. When given the string "time" seeds the
random number generator with the system time. Default
= 1.
--enzyme
<no-enzyme|trypsin|trypsin/p|chymotrypsin|elastase|clostripain|cyanogen-bromide|iodosobenzoate|proline-endopeptidase|staph-protease|asp-n|lys-c|lys-n|arg-c|glu-c|pepsin-a|elastase-trypsin-chymotrypsin|custom-enzyme>
– Specify the enzyme used to digest the proteins in
silico. Available enzymes (with the corresponding digestion rules
indicated in parentheses) include no-enzyme ([X]|[X]), trypsin
([RK]|{P}), trypsin/p ([RK]|[]), chymotrypsin ([FWYL]|{P}), elastase
([ALIV]|{P}), clostripain ([R]|[]), cyanogen-bromide ([M]|[]),
iodosobenzoate ([W]|[]), proline-endopeptidase ([P]|[]),
staph-protease ([E]|[]), asp-n ([]|[D]), lys-c ([K]|{P}), lys-n
([]|[K]), arg-c ([R]|{P}), glu-c ([DE]|{P}), pepsin-a ([FL]|{P}),
elastase-trypsin-chymotrypsin ([ALIVKRWFY]|{P}). Specifying --enzyme
no-enzyme yields a non-enzymatic digest. Warning:
the resulting peptide database may be quite large. Default
= trypsin.
--custom-enzyme <string>
– Specify rules for in silico digestion of protein
sequences. Overrides the enzyme option. Two lists of residues are
given enclosed in square brackets or curly braces and separated by
a |. The first list contains residues required/prohibited before
the cleavage site and the second list is residues after the
cleavage site. If the residues are required for digestion, they
are in square brackets, '[' and ']'. If the residues prevent
digestion, then they are enclosed in curly braces, '{' and
'}'. Use X to indicate all residues. For example, trypsin cuts
after R or K but not before P which is represented as
[RK]|{P}. AspN cuts after any residue but only before D which is
represented as [X]|[D]. Default = <empty>.
--digestion
<full-digest|partial-digest>
– Specify whether every peptide in the database must have two
enzymatic termini (full-digest) or if peptides with only one
enzymatic terminus are also included (partial-digest). Default
= full-digest.--missed-cleavages <integer>
– Maximum number of missed cleavages per peptide to allow in
enzymatic digestion. Default = 0.