dripDigest

Usage:

./dripDigest.py [options] --fasta <protein database>

Description:

DripDigest digests a protein database (FASTA file) efficiently given a fixed memory budget. The resulting peptide database is output in a compact binary format utilized by dripSearch of the DRIP toolkit. If you use dripDigest in your research, please cite:

John T. Halloran, Jeff A. Bilmes, and William S. Noble. "Learning Peptide-Spectrum Alignment Models for Tandem Mass Spectrometry". Thirtieth Conference on Uncertainty in Artificial Intelligence (UAI 2014). AUAI, Quebic City, Quebec Canada, July 2014.

Input:

--digest-dir <string> – directory to write the resulting peptide database to. Default = dripDigest-output.
--fasta <protein database> – FASTA file.

Output:

The program writes files to the folder dripDigest-output by default. The name of the output folder can be set by the user using the --digest-dir option. The following files will be created:

target.txt – a tab-delimited text file containing the digested peptides if --peptide-list is set to true.
decoy.txt – a tab-delimited text file containing the shuffled digested peptides (decoys) if --peptide-list and --decoys are set to true.
targets.bin – binary file containing digested peptides.
decoys.bin – binary file containing permuted target peptides. This file will only be created if the --decoys option is set to True.

Options:

Program settings
- --decoys <T|F> – whether to create (shuffle target peptides) and search decoy peptides. Default = T.
- --recalibrate <T|F> – whether to create second set of decoys (shuffling only). Default = False
- --peptide-buffer <int> – The maximum number of peptides to keep in memory. Default = 100000.
Peptide properties
- --max-length <integer> – The maximum length of peptides to consider. Default = 50.
- --max-mass <float> – The maximum mass (in Da) of peptides to consider. Default = 7200.
- --min-length <integer> – The minimum length of peptides to consider. Default = 6.
- --min-mass <float> – The minimum mass (in Da) of peptides to consider. Default = 200.
- --monoisotopic-precursor <T|F> – When computing the mass of a peptide, use monoisotopic masses rather than average masses. Default = true.
Amino acid modifications
- --mods-spec <string> – The general form of a modification specification has three components, as exemplified by 1STY+79.966331.
  The three components are: [max_per_peptide]residues[+/-]mass_change
  In the example, max_per_peptide is 1, residues are STY, and mass_change is +79.966331. To specify a static modification, the number preceding the amino acid must be omitted; i.e., C+57.02146 specifies a static modification of 57.02146 Da to cysteine. Note that dripDigest allows at most one modification per amino acid. Also, the default modification (C+57.02146) will be added to every mods-spec string unless an explicit C+0 is included. Default = C+57.02146.
- --cterm-peptide-mods-spec <string> – Specify peptide c-terminal modifications. See nterm-peptide-mods-spec for syntax. Default = <empty>.
- --nterm-peptide-mods-spec <string> – Specify peptide n-terminal modifications. Like --mods-spec, this specification has three components, but with a slightly different syntax. The max_per_peptide can be either "1", in which case it defines a variable terminal modification, or missing, in which case the modification is static. The residues field indicates which amino acids are subject to the modification, with the residue X corresponding to any amino acid. Finally, added_mass is defined as before. Default = <empty>.
- --max-mods <integer> – The maximum number of modifications that can be applied to a single peptide. Default = 255.
- --min-mods <integer> – The minimum number of modifications that can be applied to a single peptide. Default = 0.
Decoy database generation
- --decoy-format <shuffle|peptide-reverse> – Include a decoy version of every peptide by shuffling or reversing the target sequence or protein. In shuffle or peptide-reverse mode, each peptide is either reversed or shuffled, leaving the N-terminal and C-terminal amino acids in place. Note that peptides appear multiple times in the target database are only shuffled once. In peptide-reverse mode, palindromic peptides are shuffled. Also, if a shuffled peptide produces an overlap with the target or decoy database, then the peptide is re-shuffled up to 5 times. Note that, despite this repeated shuffling, homopolymers will appear in both the target and decoy database. The protein-reverse mode reverses the entire protein sequence, irrespective of the composite peptides. Default = shuffle.
- --keep-terminal-aminos <N|C|NC|none> – When creating decoy peptides using decoy-format=shuffle or decoy-format=peptide-reverse, this option specifies whether the N-terminal and C-terminal amino acids are kept in place or allowed to be shuffled or reversed. For a target peptide "EAMPK" with decoy-format=peptide-reverse, setting keep-terminal-aminos to "NC" will yield "EPMAK"; setting it to "C" will yield "PMAEK"; setting it to "N" will yield "EKPMA"; and setting it to "none" will yield "KPMAE". Default = NC.
- --seed <string> – When given a unsigned integer value seeds the random number generator with that value. When given the string "time" seeds the random number generator with the system time. Default = 1.
Enzymatic digestion
- --enzyme <no-enzyme|trypsin|trypsin/p|chymotrypsin|elastase|clostripain|cyanogen-bromide|iodosobenzoate|proline-endopeptidase|staph-protease|asp-n|lys-c|lys-n|arg-c|glu-c|pepsin-a|elastase-trypsin-chymotrypsin|custom-enzyme> – Specify the enzyme used to digest the proteins in silico. Available enzymes (with the corresponding digestion rules indicated in parentheses) include no-enzyme ([X]|[X]), trypsin ([RK]|{P}), trypsin/p ([RK]|[]), chymotrypsin ([FWYL]|{P}), elastase ([ALIV]|{P}), clostripain ([R]|[]), cyanogen-bromide ([M]|[]), iodosobenzoate ([W]|[]), proline-endopeptidase ([P]|[]), staph-protease ([E]|[]), asp-n ([]|[D]), lys-c ([K]|{P}), lys-n ([]|[K]), arg-c ([R]|{P}), glu-c ([DE]|{P}), pepsin-a ([FL]|{P}), elastase-trypsin-chymotrypsin ([ALIVKRWFY]|{P}). Specifying --enzyme no-enzyme yields a non-enzymatic digest. Warning: the resulting peptide database may be quite large. Default = trypsin.
- --custom-enzyme <string> – Specify rules for in silico digestion of protein sequences. Overrides the enzyme option. Two lists of residues are given enclosed in square brackets or curly braces and separated by a |. The first list contains residues required/prohibited before the cleavage site and the second list is residues after the cleavage site. If the residues are required for digestion, they are in square brackets, '[' and ']'. If the residues prevent digestion, then they are enclosed in curly braces, '{' and '}'. Use X to indicate all residues. For example, trypsin cuts after R or K but not before P which is represented as [RK]|{P}. AspN cuts after any residue but only before D which is represented as [X]|[D]. Default = <empty>.
- --digestion <full-digest|partial-digest> – Specify whether every peptide in the database must have two enzymatic termini (full-digest) or if peptides with only one enzymatic terminus are also included (partial-digest). Default = full-digest.
- --missed-cleavages <integer> – Maximum number of missed cleavages per peptide to allow in enzymatic digestion. Default = 0.

Example usage:

Partial digestion with two missed cleavages, trypsine without proline suppression digesting enzyme


          python -OO dripDigest.py \

          --min-length 6 
          --fasta data/yeast.fasta 
          --enzyme 'trypsin/p' \

          --monoisotopic-precursor true 
          --missed-cleavages 2 
          --digestion 'partial-digest'

Variable modifications, TMT labeling static modifications, cleave at every K


          ./dripDigest.py  \

    	  --fasta data/plasmo_Pfalciparum3D7_NCBI.fasta 
    	  --min-length 7 
    	  --custom-enzyme '[K]|[X]' \

    	  --mods-spec '3M+15.9949,C+57.0214,K+229.16293' 
    	  --nterm-peptide-mods-spec 'X+229.16293' \

    	  --monoisotopic-precursor true 
    	  --decoys True

In ensuing search, take max PSM over charge-varying spectra


          ./dripDigest.py  \

    	  --fasta data/plasmo_Pfalciparum3D7_NCBI.fasta 
    	  --min-length 7 
    	  --custom-enzyme '[K]|[X]' \

    	  --mods-spec '3M+15.9949,C+57.0214,K+229.16293' 
    	  --nterm-peptide-mods-spec 'X+229.16293' \

    	  --monoisotopic-precursor true 
    	  --decoys True \

    	  --recalibrate True

DRIP Toolkit home

dripDigest

Usage:

Description:

Input:

Output:

Options:

Program settings

Peptide properties

Amino acid modifications

Decoy database generation

Enzymatic digestion

Example usage:

Partial digestion with two missed cleavages, trypsine without proline suppression digesting enzyme

Variable modifications, TMT labeling static modifications, cleave at every K

In ensuing search, take max PSM over charge-varying spectra