Pages: |
Sequence information input filesSLiMDisc currently takes protein sequences in two formats FASTA and UniProt.UNIPROT long formats should follow the guidelines described here. FASTA formats should if possible follow the header description guidelines followed by the Uniprot database or alternatively on of the header styles described here For full functionality it is advised that user enter proteins in these formats other formats are accepted yet certain visualisations may not be produced and other anomalies may occur. NB. SLiMDisc is optimised for, and requires, at least three non homologous sequences for analysis. (Analysis for two sequences will be added at a future date.) The following will cause the SLiMDisc server to reject an input dataset. The server will notify the user of the reason for the rejection of the dataset.
UniProt format FASTA format Fasta format detailsSlimdisc is quite flexible about the precise details of the fasta format file used for sequence input. However, to get maximum utility and allow differentiation of sequences and source databases, it is recommended to use downloads from UniProt.The basic requirement for FASTA sequences is that descriptions should be on one line that starts ">" and is followed by one or more lines containing the actual sequence. The first word in each description should be unique. For example. Most databases and homology search results etc. can be downloaded in FASTA format. UniProt FormatThe UNIPROT format is a highly complex format described in detail at:http://ca.expasy.org/sprot/userman.html The format contains one or more protein entries separated by a "//" end of entry delimiter. Each entry is created from several lines each structured in a defined way but always beginning with a two character line code. Currently the SLiMDisc only uses the "Identification", "Sequence data" and the "Feature Table data" fields from this format. "Identification" obviously contains the protein name which is then used to identify the protein throughout the process. "Sequence data" contains the amino acid sequence of the protein which is used for various parts of the analysis. The "Feature Table data" is used in an integral way so will now be discussed in detail in the masking section. Feature Table field The Feature Table field allows the format to state the position and give information about the whereabouts of areas of interest in the proteins. There are a large number of defined keywords to describe these features and the features used for masking by SLiMDisc are given in the table below. Code Description
For example: Any one of these key names can be specified as a region to be masked by the SLiMDisc program. For example all features with the key names "TRANSMEM" can be removed from the dataset before the before the sequences are searched for motifs using the masking feature of SLiMDisc |