./dripDigest.py [options] --fasta <protein database>
DripDigest digests a protein database (FASTA file) efficiently given a fixed memory budget. The resulting peptide database is output in a compact binary format utilized by dripSearch of the DRIP toolkit. If you use dripDigest in your research, please cite:
John T. Halloran, Jeff A. Bilmes, and William S. Noble. "Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry". Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI 2014). AUAI, Quebic City, Quebec Canada, July 2014.
--digest-dir <string>
– directory to write the
resulting peptide database to. Default
= dripDigest-output
.
<protein database>
– FASTA file.The program writes files to the folder dripDigest-output
by
default. The name of the output folder can be set by the user using
the --digest-dir
option. The following files will be
created:
target.txt
–
a tab-delimited text file containing the digested peptides
if --peptide-list
is set to true.
decoy.txt
–
a tab-delimited text file containing the shuffled digested
peptides (decoys) if --peptide-list
and --decoys
are set to true.
targets.bin
–
binary file containing digested peptides.
decoys.bin
–
binary file containing permuted target peptides. This file will
only be created if the --decoys
option is set to
True.
--decoys <T|F>
–
whether to create (shuffle target peptides) and search decoy
peptides. Default = T
.
--recalibrate <T|F>
– whether to create second
set of decoys (shuffling only). Default = False
--peptide-buffer <int>
–
The maximum number of peptides to keep in memory. Default
= 100000
.
--max-length <integer>
– The maximum length of peptides to consider. Default
= 50
.
--max-mass <float>
–
The maximum mass (in Da) of peptides to consider. Default
= 7200
.
--min-length <integer>
–
The minimum length of peptides to consider. Default
= 6
.
--min-mass <float>
–
The minimum mass (in Da) of peptides to consider. Default
= 200
.
--monoisotopic-precursor <T|F>
–
When computing the mass of a peptide, use monoisotopic masses
rather than average masses. Default = true
.
--mods-spec <string>
–
The general form of a modification specification has three
components, as exemplified by 1STY+79.966331.C+57.02146
.
--cterm-peptide-mods-spec <string>
–
Specify peptide c-terminal modifications. See
nterm-peptide-mods-spec for syntax. Default
= <empty>
.
--nterm-peptide-mods-spec <string>
–
Specify peptide n-terminal modifications. Like --mods-spec, this
specification has three components, but with a slightly different
syntax. The max_per_peptide can
be either "1", in which case it defines a variable terminal
modification, or missing, in which case the modification is
static. The residues field
indicates which amino acids are subject to the modification, with
the residue X corresponding to
any amino acid. Finally, added_mass is defined as before. Default
= <empty>
.
--max-mods <integer>
–
The maximum number of modifications that can be applied to a
single peptide. Default = 255
.
--min-mods <integer>
–
The minimum number of modifications that can be applied to a
single peptide. Default = 0
.
--decoy-format
<shuffle|peptide-reverse>
–
Include a decoy version of every peptide by shuffling or reversing
the target sequence or protein. In shuffle or peptide-reverse mode,
each peptide is either reversed or shuffled, leaving the N-terminal
and C-terminal amino acids in place. Note that peptides appear
multiple times in the target database are only shuffled once. In
peptide-reverse mode, palindromic peptides are shuffled. Also, if a
shuffled peptide produces an overlap with the target or decoy
database, then the peptide is re-shuffled up to 5 times. Note that,
despite this repeated shuffling, homopolymers will appear in both
the target and decoy database. The protein-reverse mode reverses the
entire protein sequence, irrespective of the composite
peptides. Default = shuffle
.
--keep-terminal-aminos <N|C|NC|none>
–
When creating decoy peptides using decoy-format=shuffle or
decoy-format=peptide-reverse, this option specifies whether the
N-terminal and C-terminal amino acids are kept in place or allowed
to be shuffled or reversed. For a target peptide "EAMPK" with
decoy-format=peptide-reverse, setting keep-terminal-aminos to "NC"
will yield "EPMAK"; setting it to "C" will yield "PMAEK"; setting
it to "N" will yield "EKPMA"; and setting it to "none" will yield
"KPMAE". Default = NC
.
--seed <string>
–
When given a unsigned integer value seeds the random number
generator with that value. When given the string "time" seeds the
random number generator with the system time. Default
= 1
.
--enzyme
<no-enzyme|trypsin|trypsin/p|chymotrypsin|elastase|clostripain|cyanogen-bromide|iodosobenzoate|proline-endopeptidase|staph-protease|asp-n|lys-c|lys-n|arg-c|glu-c|pepsin-a|elastase-trypsin-chymotrypsin|custom-enzyme>
– Specify the enzyme used to digest the proteins in
silico. Available enzymes (with the corresponding digestion rules
indicated in parentheses) include no-enzyme ([X]|[X]), trypsin
([RK]|{P}), trypsin/p ([RK]|[]), chymotrypsin ([FWYL]|{P}), elastase
([ALIV]|{P}), clostripain ([R]|[]), cyanogen-bromide ([M]|[]),
iodosobenzoate ([W]|[]), proline-endopeptidase ([P]|[]),
staph-protease ([E]|[]), asp-n ([]|[D]), lys-c ([K]|{P}), lys-n
([]|[K]), arg-c ([R]|{P}), glu-c ([DE]|{P}), pepsin-a ([FL]|{P}),
elastase-trypsin-chymotrypsin ([ALIVKRWFY]|{P}). Specifying --enzyme
no-enzyme yields a non-enzymatic digest. Warning:
the resulting peptide database may be quite large. Default
= trypsin
.
--custom-enzyme <string>
– Specify rules for in silico digestion of protein
sequences. Overrides the enzyme option. Two lists of residues are
given enclosed in square brackets or curly braces and separated by
a |. The first list contains residues required/prohibited before
the cleavage site and the second list is residues after the
cleavage site. If the residues are required for digestion, they
are in square brackets, '[' and ']'. If the residues prevent
digestion, then they are enclosed in curly braces, '{' and
'}'. Use X to indicate all residues. For example, trypsin cuts
after R or K but not before P which is represented as
[RK]|{P}. AspN cuts after any residue but only before D which is
represented as [X]|[D]. Default = <empty>
.
--digestion
<full-digest|partial-digest>
– Specify whether every peptide in the database must have two
enzymatic termini (full-digest) or if peptides with only one
enzymatic terminus are also included (partial-digest). Default
= full-digest
.--missed-cleavages <integer>
– Maximum number of missed cleavages per peptide to allow in
enzymatic digestion. Default = 0
.