SLiMDisc Help Pages

Pages:

Sequence information input files

SLiMDisc currently takes protein sequences in two formats FASTA and UniProt.

UNIPROT long formats should follow the guidelines described here.

FASTA formats should if possible follow the header description guidelines followed by the Uniprot database or alternatively on of the header styles described here

For full functionality it is advised that user enter proteins in these formats other formats are accepted yet certain visualisations may not be produced and other anomalies may occur.

NB. SLiMDisc is optimised for, and requires, at least three non homologous sequences for analysis. (Analysis for two sequences will be added at a future date.)


The following will cause the SLiMDisc server to reject an input dataset.
The server will notify the user of the reason for the rejection of the dataset.

  • If a fasta protein header does not conform to a header style described here.
  • If a character other than [A-Za-z0-9._-] is used in a protein name.
  • If a protein sequence is less than 8 amino acids.
  • If there are less than 3 non homologous proteins contained within the dataset.
Examples of acceptable input formats are available here:
UniProt format
FASTA format

Fasta format details

Slimdisc is quite flexible about the precise details of the fasta format file used for sequence input. However, to get maximum utility and allow differentiation of sequences and source databases, it is recommended to use downloads from UniProt.

The basic requirement for FASTA sequences is that descriptions should be on one line that starts ">" and is followed by one or more lines containing the actual sequence. The first word in each description should be unique.

For example.


Most databases and homology search results etc. can be downloaded in FASTA format.


UniProt Format

The UNIPROT format is a highly complex format described in detail at:

http://ca.expasy.org/sprot/userman.html

The format contains one or more protein entries separated by a "//" end of entry delimiter. Each entry is created from several lines each structured in a defined way but always beginning with a two character line code.
Currently the SLiMDisc only uses the "Identification", "Sequence data" and the "Feature Table data" fields from this format. "Identification" obviously contains the protein name which is then used to identify the protein throughout the process. "Sequence data" contains the amino acid sequence of the protein which is used for various parts of the analysis. The "Feature Table data" is used in an integral way so will now be discussed in detail in the masking section.
Feature Table field The Feature Table field allows the format to state the position and give information about the whereabouts of areas of interest in the proteins. There are a large number of defined keywords to describe these features and the features used for masking by SLiMDisc are given in the table below.

Code Description
SIGNALExtent of a signal sequence (prepeptide).
PROPEPExtent of a propeptide.
TOPO_DOMTopological domain.
TRANSMEMExtent of a transmembrane region.
DOMAINExtent of a domain, which is defined as a specific combination of secondary structures organized into a characteristic three-dimensional structure or fold.
REPEATExtent of an internal sequence repetition.
REGIONExtent of a region of interest in the sequence.
COILEDExtent of a coiled-coil region.

For example:



Any one of these key names can be specified as a region to be masked by the SLiMDisc program. For example all features with the key names "TRANSMEM" can be removed from the dataset before the before the sequences are searched for motifs using the masking feature of SLiMDisc