FASTA file format

FASTA file format

Crux searches protein sequence databases given in FASTA format. The format is very simple. Every entry consists of a sequence identifier (ID), an optional comment (COMMENT), and a sequence (SEQUENCE). The format looks like this:
  >ID COMMENT
  SEQUENCE
The special character ">" marks the beginning of a new sequence. The ">" character is followed immediately by the sequence identifier. The rest of that line is occupied by the optional comment. Subsequent lines contain the sequence itself.

Some rules about representing sequences:
A single protein sequence can span multiple lines. The > character occurring at the beginning of the line indicates the end of the sequence.

Case doesn't matter. Crux Suite converts everything to uppercase.

White space (spaces and newlines) within the sequence are ignored.

Characters should be from the amino acid alphabet, which contains twenty characters for amino acids ("ACDEFGHIKLMNPQRSTVWY") and is augmented by four more ambiguous characters ("BUXZ"):
    A  alanine                         P  proline
    B  aspartate or asparagine         Q  glutamine
    C  cystine                         R  arginine
    D  aspartate                       S  serine
    E  glutamate                       T  threonine
    F  phenylalanine                   U  any
    G  glycine                         V  valine
    H  histidine                       W  tryptophan
    I  isoleucine                      Y  tyrosine
    K  lysine                          Z  glutamate or glutamine
    L  leucine                         X  any
    M  methionine                      
    N  asparagine
Here is an example of three sequences in FASTA format.

DRIP ToolKit home