SLiMFinder Help Pages

Pages:

Sequence information input files

NB. SLiMFinder is optimised for, and requires, at least three sequences for analysis. (Analysis for two sequences will be added at a future date.)

Examples of acceptable input formats are available here:
UniProt format
FASTA format


Two methods of entry are possible:

Raw protein sequencess
SLiMFinder currently takes protein sequences in two formats FASTA and UniProt long format. FASTA formats should if possible follow the the description guidelines followed by the Uniprot database. Other formats are accepted yet certain visualisations may not be produced and other anomalies may occur.

or

UniProt Ids
The right hand panel of the inputs section allows a list of UniProt Ids to be entered. These Ids will be fetched and used as the input data for SLiMFinder. For full functionality it is advised that user enter proteins data in this way.

Fasta format details

SLiMFinder is quite flexible about the precise details of the fasta format file used for sequence input. However, to get maximum utility and allow differentiation of sequences and source databases, it is recommended to use downloads from UniProt.

The basic requirement for FASTA sequences is that descriptions should be on one line that starts ">" and is followed by one or more lines containing the actual sequence. The first word in each description should be unique.

For example.


Most databases and homology search results etc. can be downloaded in FASTA format.


UniProt Format

The UNIPROT format is a highly complex format described in detail at:

http://ca.expasy.org/sprot/userman.html

The format contains one or more protein entries separated by a "//" end of entry delimiter. Each entry is created from several lines each structured in a defined way but always beginning with a two character line code.
Currently the SLiMFinder only uses the "Identification", "Sequence data" and the "Feature Table data" fields from this format. "Identification" obviously contains the protein name which is then used to identify the protein throughout the process. "Sequence data" contains the amino acid sequence of the protein which is used for various parts of the analysis. The "Feature Table data" is used in an integral way so will now be discussed in detail in the masking section.
Feature Table field The Feature Table field allows the format to state the position and give information about the whereabouts of areas of interest in the proteins. There are a large number of defined keywords to describe these features and the features used for masking by SLiMFinder are given in the table below.

Code Description
SIGNALExtent of a signal sequence (prepeptide).
PROPEPExtent of a propeptide.
TOPO_DOMTopological domain.
TRANSMEMExtent of a transmembrane region.
DOMAINExtent of a domain, which is defined as a specific combination of secondary structures organized into a characteristic three-dimensional structure or fold.
REPEATExtent of an internal sequence repetition.
REGIONExtent of a region of interest in the sequence.
COILEDExtent of a coiled-coil region.

For example:



Any one of these key names can be specified as a region to be masked by the SLiMFinder program. For example all features with the key names "TRANSMEM" can be removed from the dataset before the before the sequences are searched for motifs using the masking feature of SLiMFinder