SLiMFinder Help Pages

Pages:

Help:

Help Overview

Main Help
  • QuickStart. Just the basics to get going with the server.
  • Input. Details of the necessary input requirements of the program.
  • Masking. Step by step information on how to mask your input dataset.
  • SLiMBuild. Additional SLiMBuild motif space options.
  • SLiMChance. Additional SLiMChance significance/filtering options.
  • Output. Quick overview of the output of the SLiMFinder server. (See example for details.)
  • References. A summary of the papers underpinning the server.
  • FAQ. Some Frequently Asked Questions and their answers.
Walkthrough
Walkthrough of example SLiMFinder analysis with screenshots.

Manual - (PDF)
Manual for the standalone version of SLiMFinder in pdf format. May contain details which may not be obvious from the server implentation of the program.

User-group
SLiMFinder user-group for additional community support.

Example Input
Example UniProt input file for proteins containing the Dynein light chain interaction motif.

Example Output
Fully functional results page corresponding to example input run with default options

CompariMotif
Help for the accessory application, CompariMotif, for comparing SLiMFinder results with known motifs.

References and Citations Papers to cite when using results in publications.

Standard analysis

  1. Input Uniprot accession numbers and Get sequences to retrieve entries.

  2. Check entries have been retrieved correctly.

  3. Select/check masking options.

  4. To return motifs with variable-length wildcard spacers, switch on the wildvar SLiMBuild option.

  5. To only return Significant motifs, change the probcut SLiMChance option to 0.05.

  6. Click Submit job to submit your job to the Bioware queue.

  7. Monitor progress or bookmark the page and revisit later.

  8. Explore the results. Write down the job ID to retrieve results again later.

  9. Publish work. Cite SLiMFinder. Live long and prosper.

Sequence information input files

NB. SLiMFinder is optimised for, and requires, at least three sequences for analysis. (Analysis for two sequences will be added at a future date.)

Examples of acceptable input formats are available here:
UniProt format
FASTA format


Two methods of entry are possible:

Raw protein sequences
SLiMFinder currently takes protein sequences in two formats FASTA and UniProt long format. FASTA formats should if possible follow the the description guidelines followed by the Uniprot database. Other formats are accepted yet certain visualisations may not be produced and other anomalies may occur.

or

UniProt Ids
The right hand panel of the inputs section allows a list of UniProt Ids to be entered. These Ids will be fetched and used as the input data for SLiMFinder. For full functionality (including RLC masking) it is advised that user enter proteins data in this way.

Fasta format details

SLiMFinder is quite flexible about the precise details of the fasta format file used for sequence input. However, to get maximum utility and allow differentiation of sequences and source databases, it is recommended to use downloads from UniProt.

The basic requirement for FASTA sequences is that descriptions should be on one line that starts ">" and is followed by one or more lines containing the actual sequence. The first word in each description should be unique.

For example.


Most databases and homology search results etc. can be downloaded in FASTA format.

UniProt Format

The UNIPROT format is a highly complex format described in detail at:

http://ca.expasy.org/sprot/userman.html

The format contains one or more protein entries separated by a "//" end of entry delimiter. Each entry is created from several lines each structured in a defined way but always beginning with a two character line code.
Currently the SLiMFinder only uses the "Identification", "Sequence data" and the "Feature Table data" fields from this format. "Identification" obviously contains the protein name which is then used to identify the protein throughout the process. "Sequence data" contains the amino acid sequence of the protein which is used for various parts of the analysis. The "Feature Table data" is used in an integral way so will now be discussed in detail in the masking section.
Feature Table field The Feature Table field allows the format to state the position and give information about the whereabouts of areas of interest in the proteins. There are a large number of defined keywords to describe these features and the features used for masking by SLiMFinder are given in the table below.

Code Description
SIGNALExtent of a signal sequence (prepeptide).
PROPEPExtent of a propeptide.
TOPO_DOMTopological domain.
TRANSMEMExtent of a transmembrane region.
DOMAINExtent of a domain, which is defined as a specific combination of secondary structures organized into a characteristic three-dimensional structure or fold.
REPEATExtent of an internal sequence repetition.
REGIONExtent of a region of interest in the sequence.
COILEDExtent of a coiled-coil region.

For example:



Any one of these key names can be specified as a region to be masked by the SLiMFinder program. For example all features with the key names "TRANSMEM" can be removed from the dataset before the before the sequences are searched for motifs using the masking feature of SLiMFinder

Masking Options

By default, sequence masking is on. The unchecking masking will switch of all masking.

Disorder masking

SLiMs tend to occur in disordered regions of proteins. The SLiMFinder server uses IUPRED (Dosztanyi et al. 2005) to predict regions of disorder with a relaxed score cut-off of 0.2. Residues predicted to be "intrinsically ordered" are masked out. This can be toggled on/off using the dismask option.

Because disorder masking utilises a per-residue score, there are often single residues that are just above/below the threshold in a region that is otherwise (dis)ordered. Regions can therefore be smoothed out using the minregion option, which stipulates the minimum number of consecutive residues that must have the same disorder state. (Dis)ordered regions smaller than this are assimilated into the neighbouring regions, starting with the smallest (1aa regions) and working up until all regions are large enough; within each region size, the sequence is traversed from N-terminal to C-terminal.

Conservation masking

By default conservation masking is used, metazoan orthologues are retrieved and masking of underconserved residues is carried out. For more details see:

Davey NE, Shields DC & Edwards RJ (2009):
Masking residues using context-specific evolutionary conservation significantly improves short linear motif discovery. Bioinformatics 25(4): 443-50.
[Bioinformatics.] [PubMed]



Feature masking

If the entry format is UniProt then defined features can be maskeds, areas such as transmembrane regions, protein domains and inaccessible residues can be masked as they are areas which have a low likelihood of containing motifs.

The same mechanism used for masking these region also allow the user identify specific regions of the proteins in which to confine the search , for example the user may wish to look at motifs which are occurring in the cytoplasmic regions of a set of proteins or may have prior knowledge of a region possibly containing a functional motif.

There are two types of feature masking in SLiMFinder: Inclusive masking and Exclusive masking:
  • Inclusive masking will remove all of the sequence except the segment specified.
  • Exclusive masking will remove the segment specified.

Inclusive masking is preformed first. This means that exclusively masked regions appearing inside an inclusively masked region will be removed.

Masking is based on UniProt features, examples are:
SIGNALExtent of a signal sequence (prepeptide).
PROPEPExtent of a propeptide.
TOPO_DOMTopological domain.
TRANSMEMExtent of a transmembrane region.
DOMAINExtent of a domain, which is defined as a specific combination of secondary structures organized into a characteristic three-dimensional structure or fold.
REPEATExtent of an internal sequence repetition.
REGIONExtent of a region of interest in the sequence.
COILEDExtent of a coiled-coil region.
For example, [DOMAIN,TRANMEM] in the feature mask field will mask out Domain and Transmembrane regions annotated by Uniprot.

Custom feature masking by case

If you cannot mask features directly from UniProt entries, regions of uploaded sequences can be masked by using upper or lower case. Simply make sure all regions to be masked are in one case and all regions to be searched in the other and enter "Upper" or "Lower" in the casemask box. This will mask out the specified case. Note that this option requires uploading sequences and cannot therefore be used in conjunction with conservation masking. To use both, please download the SLiMFinder application or contact us.

Additional masking options

For more details of these options, please refer to the SLiMFinder Manual.
  • Motif masking. If you wish to mask out common recurring motifs (e.g. [KR][KR]) this can be done using the motifmask box. Simply enter a list of motifs as regular expressions.
  • Complexity masking. SLiMFinder uses a simple complexity filter. If any amino acid occur N+ times in a stretch of L amino acids (compmask=N,L) then the central (N-2) occurrences of that amino acid are replaced with Xs. E.g. PFPPIPLP would become PFXXIXLP. By default 5+ amino acids in an 8aa stretch invoke masking.
  • N-terminal methionines. By default these are masked to avoid artefactual N-terminal motifs. If the termini SLiMBuild option is off or your proteins are otherwise N-terminally truncated, you can switch this masking off using metmask.
  • Position-specific masking. Just as N-terminal methionines are over-represented, eukaryotes have an abundance of alanines at position 2. For this reason, they are also masked out by default (posmask).
  • Amino-acid masking. In extreme cases, you can mask out all occurrences of a given list of amino acids (aamask). This reduces the alphabet and thus reduces the motif search space.

SLiMBuild Options

A number of different options can be set to control how SLiMFinder controls the motif search space during SLiMBuild motif construction. More details can be found in the SLiMFinder Manual.

Motif Search Options

  • minwild. Minimum number of consecutive wildcard positions to allow.
  • maxwild. Maximum number of consecutive wildcard positions to allow.
  • slimlen. Maximum length of SLiMs to return (no. non-wildcard positions).
  • wildvar. Whether to allow variable length wildcards.

Motif Occurrence Options

  • minocc. Minimum number of unrelated occurrences for returned SLiMs. (Proportion of UP if < 1).
  • absmin. Used if minocc < 1 to define absolute min. UP occ.

Ambiguity Options

  • equiv. List of TEIRESIAS-style amino acid ambiguities to use.
  • preamb. Whether to search for ambiguous motifs during motif discovery.
  • combamb. Whether to search for combined amino acid degeneracy and variable wildcards.

UPC/BLAST Options

  • blaste. BLAST e-value threshold for determining relationships.

Special Options

  • alphahelix. Special i, i+3/4, i+7 motif discovery.
  • termini. Whether to add termini characters (^ & $) to search sequences.
  • dna. Whether search input sequences as DNA. Make sure ambiguities (equiv) etc. are changed accordingly.

SLiMChance/Filtering Options

Further options control the SLiMChance significance algorithm, which assesses the significance of motifs returned by SLiMBuild. Additional filtering options can also customise which motifs will be returned. More details can be found in the SLiMFinder Manual.

SLiMChance Options

  • probcut. Sig (approx. p-value) cut-off for returned motifs. (Remember, significant results are <= 0.05!
  • maskfreq. Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations.
  • sigv. Use SigV statistics (more accurate but slower).

Motif Filtering Options

  • topranks. Will only output top X motifs meeting probcut.
  • minic. Minimum information content for returned motifs.
  • musthave. Returned motifs must contain one or more of the AAs in LIST (reduces search space).
  • query. Return only SLiMs that occur in 1+ Query sequences (Name/AccNum).

Output Description

An example results output for a Dynein Light Chain binding protein dataset is available here and in the screenshot walkthrough.

Example Visualisations



The Concise Motif Map trys to condense all the attributes of a protein into one graphic. All protein features masked in the dataset are overlayed on to the protein. Red for proteins , green for fold , khaki for transmembrane , pink for tissue or location. All other masked features are grey and labelled.Graphs for Net charge , Hydropathy and absolute charge are plotted above the protein and graphs for disorder ( using IUPRED) and surface accessibility (using the Emini method) are graphed below the protein. The top twenty motifs are placed along the protein as coloured balls with their rank contained in the centre of the ball. THese balls are placed so if the case arises that two or more motifs overlap the the highest ranking motif will be placed on top. Also non-overlapping lollipops decend from the motif ball to give the actual motif sequence.




The Motif alignment for the 30 residues either side of the motif takes a motif and the stacks each protein containing that motif on top of each other. The alignment contains the 30 residues on either side of the sequence. The alignments are ungapped and are not aligned in terms of a multiple algnment. This alignment will help the user to understand the context of the motif and allow a visualisation of any obvious homology between sequences around the motif which would give information about whether a motif is convergently or divergently evolved

References and Citations

When using SLiMFinder results in a publication, please cite the main PLoS One paper:

  • Edwards RJ, Davey NE & Shields DC (2007): SLiMFinder: A probabilistic method for identifying over-represented, convergently evolved, short linear motifs in proteins. PLoS ONE 2(10): e967. [PLoS One.] [PubMed]

In addition, SLiMFinder uses the following underlying software:

  • Altschul SF, Gish W, Miller W, Myers EW & Lipman DJ (1990): Basic local alignment search tool.
    J. Mol. Biol. 215:403-410.

If using IUPRED disorder prediction to mask input sequences and/or filter results, please cite:

  • Dosztanyi Z, Csizmok V, Tompa P & Simon I (2005): IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics 21:3433-3434.

If using Relative Local Conservation masking, in addition to the GOPHER citations (below) please cite:

  • Davey NE, Shields DC & Edwards RJ (2009): Masking residues using context-specific evolutionary conservation significantly improves short linear motif discovery. Bioinformatics 25(4): 443-50. [Bioinformatics.] [PubMed]

If using alignments of homologous proteins, generated by GOPHER, please cite:

  • Davey NE*, Edwards RJ*, Shields DC (2007): The SLiMDisc server: short, linear motif discovery in proteins. Nucleic Acids Res. 35(Web Server issue):W455-9. [Nucleic Acids Res.] [PubMed]
    *Joint first authors
  • Edgar RC (2004): MUSCLE: a multiple sequence alignment method with reduced time and space complexity.
    BMC Bioinformatics 5:113.

If using the SigV advanced SLiMChance statistics,please cite:

  • Davey NE, Edwards RJ & Shields DC (2010): Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins. BMC Bioinformatics 11: 14. [BMC Bioinformatics.]



Contributing Labs

This server is hosted by the Clinical Bioinformatics group led by Prof. Denis Shields in the Conway Institute of Biomolecular and Biomedical Research. The tools have been developed by Rich Edwards (currently at The University of Southampton) and Norman Davey (currently at EMBL Heidelberg).

This project is a collaboration between 3 institutions, Conway Institute of Biomolecular and Biomedical Research at University College Dublin (Dublin, Ireland), School of Biological Science at University of Southampton (Southampton, England) and European Molecular Biology Laboratories (Heidelberg, Germany).

Q. Why can I only use conservation masking for UniProt entries downloaded through your site?
A. To save time, the server re-uses GOPHER alignments that have been made before. These are recognised by the accession number of the proteins. It is therefore vital to ensure that the same accession number always corresponds to the same protein sequence. Conservation masking of custom sequences can be performed using the downloadable version.

Q. If significance is Sig <= 0.05 why is probcut set to 0.99 by default?
A. We prefer to return more motifs by default and let the user determine what they think is interesting as different situations have different criteria. We urge extreme caution when intepreting motifs with Sig > 0.5 as they are most probably over-represented by chance.

Q. Why are the results slightly different for the server when running benchmark data from the original papers?
A. The server features a couple of changes to the default settings. In particular, an updated sequence database is used for orthologue predictions for conservation masking.

Q. Why do my results not have any motifs with variable-length wildcards?
A. By default, flexible-length wildcards are turned off to improve run-times. This option can be switched back on using wildvar in the Build options. If you want motifs with both flexible-length wildcards and degenerate amino acid positions, also switch on combamb. (This will slow time jobs a bit.)