Distribution compiled: Thu Feb 11 20:01:56 2010
Questions/Comments?: please contact software@cabbagesofdoom.co.uk
The software should run on any system that has Python installed. Additional software may be necessary for full functionality. Further details can be found in the manuals supplied.
Copyright (C) 2009 Richard J. Edwards <software@cabbagesofdoom.co.uk>
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
Author contact: <software@cabbagesofdoom.co.uk> / School of Biological Sciences, University of Southampton, UK.
To incorporate this module into your own programs, please see GNU Lesser General Public License disclaimer in rje.py
Module: slimfinder
Description: Short Linear Motif Finder
Version: 4.0
Last Edit: 11/02/10
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
Short linear motifs (SLiMs) in proteins are functional microdomains of fundamental importance in many biological
systems. SLiMs typically consist of a 3 to 10 amino acid stretch of the primary protein sequence, of which as few
as two sites may be important for activity, making identification of novel SLiMs extremely difficult. In particular,
it can be very difficult to distinguish a randomly recurring "motif" from a truly over-represented one. Incorporating
ambiguous amino acid positions and/or variable-length wildcard spacers between defined residues further complicates
the matter.
SLiMFinder is an integrated SLiM discovery program building on the principles of the SLiMDisc software for accounting
for evolutionary relationships [Davey NE, Shields DC & Edwards RJ (2006): Nucleic Acids Res. 34(12):3546-54].
SLiMFinder is comprised of two algorithms:
SLiMBuild identifies convergently evolved, short motifs in a dataset. Motifs with fixed amino acid positions are
identified and then combined to incorporate amino acid ambiguity and variable-length wildcard spacers. Unlike
programs such as TEIRESIAS, which return all shared patterns, SLiMBuild accelerates the process and reduces returned
motifs by explicitly screening out motifs that do not occur in enough unrelated proteins. For this, SLiMBuild uses
the "Unrelated Proteins" (UP) algorithm of SLiMDisc in which BLAST is used to identify pairwise relationships.
Proteins are then clustered according to these relationships into "Unrelated Protein Clusters" (UPCs), which are
defined such that no protein in a UPC has a BLAST-detectable relationship with a protein in another UPC. If desired,
SLiMBuild can be used as a replacement for TEIRESIAS in other software (teiresias=T slimchance=F).
SLiMChance estimates the probability of these motifs arising by chance, correcting for the size and composition of
the dataset, and assigns a significance value to each motif. Motif occurrence probabilites are calculated
independently for each UPC, adjusted the size of a UPC using the Minimum Spanning Tree algorithm from SLiMDisc. These
individual occurrence probabilities are then converted into the total probability of the seeing the observed motifs
the observed number of (unrelated) times. These probabilities assume that the motif is known before the search. In
reality, only over-represented motifs from the dataset are looked at, so these probabilities are adjusted for the
size of motif-space searched to give a significance value. This is an estimate of the probability of seeing that
motif, or another one like it. These values are calculated separately for each length of motif. Where pre-known
motifs are also of interest, these can be given with the slimcheck=MOTIFS option and will be added to the output.
SLiMFinder version 4.0 introduced a more precise (but more computationally intensive) statistical model, which can
be switched on using sigprime=T. Likewise, the more precise (but more computationally intensive) correction to the
mean UPC probability heuristic can be switched on using sigv=T. (Note that the other SLiMChance options may not
work with either of these options.) The allsig=T option will output all four scores. In this case, SigPrimeV will be
used for ranking etc. unless probscore=X is used.
Where significant motifs are returned, SLiMFinder will group them into Motif "Clouds", which consist of physically
overlapping motifs (2+ non-wildcard positions are the same in the same sequence). This provides an easy indication
of which motifs may actually be variants of a larger SLiM and should therefore be considered together.
Additional Motif Occurrence Statistics, such as motif conservation, are handled by the rje_slimlist module. Please
see the documentation for this module for a full list of commandline options. These options are currently under
development for SLiMFinder and are not fully supported. See the SLiMFinder Manual for further details. Note that the
OccFilter *does* affect the motifs returned by SLiMBuild and thus the TEIRESIAS output (as does min. IC and min.
Support) but the overall Motif StatFilter *only* affects SLiMFinder output following SLiMChance calculations.
Secondary Functions:
The "MotifSeq" option will output fasta files for a list of X:Y, where X is a motif pattern and Y is the output file.
The "Randomise" function will take a set of input datasets (as in Batch Mode) and regenerate a set of new datasets
by shuffling the UPC among datasets. Note that, at this stage, this is quite crude and may result in the final
datasets having fewer UPC due to common sequences and/or relationships between UPC clusters in different datasets.
Commandline: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Basic Input/Output Options ###
Module: SLiMSearch
Description: Short Linear Motif Search tool
Version: 1.4
Last Edit: 08/01/10
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
SLiMSearch is a tool for finding pre-defined SLiMs (Short Linear Motifs) in a protein sequence database. SLiMSearch
can make use of corrections for evolutionary relationships and a variation of the SLiMChance alogrithm from
SLiMFinder to assess motifs for statistical over- and under-representation. SLiMSearch is a replacement for PRESTO
and uses many of the same underlying modules.
Benefits of SLiMSearch that make it more useful than a lot of existing tools include:
* searching with mismatches rather than restricting hits to perfect matches.
* optional equivalency files for searching with specific allowed mismatched (e.g. charge conservation)
* generation or reading of alignment files from which to calculate conservation statistics for motif occurrences.
* additional statistics, inlcuding protein disorder, surface accessibility and hydrophobicity predictions
* recognition of "n of m" motif elements in the form
times across which m positions. E.g.
Main output for SLiMSearch is a delimited file of motif/peptide occurrences but the motifaln=T and proteinaln=T also
allow output of alignments of motifs and their occurrences.
Commandline: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Basic Input/Output Options ###
Single line per motif format = 'Name Sequence #Comments' (Comments are optional and ignored)
Alternative formats include fasta, SLiMDisc output and raw motif lists.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### SearchDB Options I: Input Protein Sequence Masking ###
### SearchDB Options II: Evolutionary Filtering ###
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### SLiMChance Options ###
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Output Options ###
- 0 = Delete no files
- 1 = Delete all bar *.upc and *.pickle files
- 2 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)
* See also rje_slimcalc options for occurrence-based calculations and filtering *
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje
Other modules needed: None
Module: qslimfinder
Description: Query Short Linear Motif Finder
Version: 1.0
Last Edit: 29/05/09
Copyright (C) 2008 Richard J. Edwards - See source code for GNU License Notice
Function:
QSLiMFinder is a modification of the basic SLiMFinder tool to specifically look for SLiMs shared by a query sequence
and one or more additional sequences. To do this, SLiMBuild first identifies all motifs that are present in the query
sequences before removing it (and its UPC) from the dataset. The rest of the search and stats takes place using the
remainder of the dataset but only using motifs found in the query. The final correction for multiple testing is made
using a motif space defined by the original query sequence, rather than the full potential motif space used by the
original SLiMFinder. This is offset against the increased probability of the observed motif support values due to the
reduction of support that results from removing the query sequence but could potentially still identify SLiMs will
increased significance.
Note that minocc and ambocc values *include* the query sequence, e.g. minocc=2 specifies the query and ONE other UPC.
Commandline: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Basic Input/Output Options ###
Module: comparimotif
Description: Motif vs Motif Comparison Software
Version: 3.5
Last Edit: 04/08/09
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
CompariMotif is a piece of software with a single objective: to take two lists of protein motifs and compare them to
each other, identifying which motifs have some degree of overlap, and identifying the relationships between those
motifs. It can be used to compare a list of motifs with themselves, their reversed selves, or a list of previously
published motifs, for example (e.g. ELM (http://elm.eu.org/)). CompariMotif outputs a table of all pairs of matching
motifs, along with their degree of similarity (information content) and their relationship to each other.
The best match is used to define the relationship between the two motifs. These relationships are comprised of the
following keywords:
Match type keywords identify the type of relationship seen:
* Exact = all the matches in the two motifs are precise
* Variant = focal motif contains only exact matches and subvariants of degenerate positions in the other motif
* Degenerate = the focal motif contains only exact matches and degenerate versions of positions in the other motif
* Complex = some positions in the focal motif are degenerate versions of positions in the compared motif, while
others are subvariants of degenerate positions.
Match length keywords identify the length relationships of the two motifs:
* Match = both motifs are the same length and match across their entire length
* Parent = the focal motif is longer and entirely contains the compared motif
* Subsequence = the focal motif is shorter and entirely contained within the compared motif
* Overlap = neither motif is entirely contained within the other
This gives sixteen possible classifications for each motif's relationship to the compared motif.
Input:
CompariMotif can take input in a number of formats. The preferred format is SLiMSearch format, while is a single line
motif format: 'Name Sequence #Comments' (Comments are optional and ignored). Alternative inputs include SLiMDisc and
Slim Pickings output, raw lists of motifs, and fasta format.
Output:
The main output for CompariMotif is delimited text file containing the following fields:
* File1 = Name of motifs file (if outstyle=multi)
* File2 = Name of searchdb file (if outstyle=multi)
* Name1 = Name of motif from motif file 1
* Name2 = Name of motif from motif file 2
* Motif1 = Motif (pattern) from motif file 1
* Motif2 = Motif (pattern) from motif file 2
* Sim1 = Description of motif1's relationship to motif2
* Sim2 = Description of motif2's relationship to motif1
* Match = Text summary of matched region
* MatchPos = Number of matched positions between motif1 and motif2 (>= mishare=X)
* MatchIC = Information content of matched positions
* NormIC = MatchA as a proportion of the maximum possible MatchA
* Score = Heuristic score (MatchPos x NormIC) for ranking motif matches
* Info1 = Ambiguity score of motif1
* Info2 = Ambiguity score of motif2
* Desc1 = Description of motif1 (if motdesc = 1 or 3)
* Desc2 = Description of motif2 (if motdesc = 2 or 3)
With the exception of the file names, which are only output if outstyle=multi, the above is the output for the
default "normal" output style. If outstyle=single then only statistics for motif2 (the searchdb motif) are output
as this is designed for searches using a single motif against a motif database. If outstyle=normalsplit or
outstyle=multisplit then motif1 information is grouped together, followed by motif2 information, followed by the
match statistics. More information can be found in the CompariMotif manual.
Commandline:
## Basic Input Parameters ##
Module: unifake
Description: Fake UniProt DAT File Generator
Version: 1.2
Last Edit: 10/10/08
Copyright (C) 2008 Richard J. Edwards - See source code for GNU License Notice
Function:
This program runs a number of in silico predication programs and converts protein sequences into a fake UniProt DAT
flat file. Additional features may be given as one or more tables, using the features=LIST option. Please see the
UniFake Manual for more details.
Commandline:
### ~ INPUT OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
Module: gopher
Description: Generation of Orthologous Proteins from High-Throughput Estimation of Relationships
Version: 2.7
Last Edit: 05/02/10
Copyright (C) 2005 Richard J. Edwards - See source code for GNU License Notice
Function:
This script is designed to take in two sequences files and generate datasets of orthologous sequence alignments.
The first [seqin] sequence set is the 'queries' around which orthologous datasets are to be assembled. This is now
optimised for a dataset consisting of one protein per protein-coding gene, although splice variants should be dealt
with OK and treated as paralogues. This will only cause problems if the postdup=T option is used, which restricts
orthologues returned to be within the last post-duplication clade for the sequence.
The second [orthdb] is the list of proteins from which the orthologues will be extracted. The seqin sequences are
then BLASTed against the orthdb and processed (see below) to retain putative orthologues using an estimation of the
phylogenetic relationships based on pairwise sequences similarities.
NB. As of version 2.0, gopher=FILE has been replaced with seqin=FILE for greater rje python consistency. The allqry
option has been removed. Please cleanup the input data into a desired non-redundant dataset before running GOPHER.
(In many ways, GOPHER's strength is it's capacity to be run for a single sequence of interest rather than a whole
genome, and it is this functionality that has been concentrated on for use with PRESTO and SLiM Pickings etc.) The
output of statistics for each GOPHER run has also been discontinued for now but may be reintroduced with future
versions. The phosalign command (to produce a table of potential phosphorylation sites (e.g. S,T,Y) across
orthologues for special conservation of phosphorylation prediction analyses) has also been discontinued for now.
Version 2.1 has tightened up on the use of rje_seq parameters that were causing trouble otherwise. It is now the
responsibility of the user to make sure that the orthologue database meets the desired criteria. Duplicate accession
numbers will not be tolerated by GOPHER and (arbitrary) duplicates will be deleted if the sequences are the same, or
renamed otherwise. Renaming may cause problems later. It is highly desirable not to have two proteins with the same
accession number but different amino acid sequences. The following commands are added to the rje_seq object when input
is read: accnr=T unkspec=F specnr=F gnspacc=T. Note that unknown species are also not permitted.
The process for dataset assembly is as follows for each protein :
1. BLAST against orthdb [orthblast]
> BLASTs saved in BLAST/AccNum.blast
2. Work through BLAST hits, indentifying paralogues (query species duplicates) and the closest homologue from each
other species. This involves a second BLAST of the query versus original BLAST hits (e-value=10, no complexity
filter). The best sequence from each species is kept, i.e. the one with the best similarity to the query and not part
of a clade with any paralogue that excludes the query. (If postdup=T, the hit must be in the query's post duplication
clade.) In addition hits: [orthfas]
* Must have minimum identity level with Query
* Must be one of the 'good species' [goodspec=LIST]
> Save reduced sequences as ORTH/AccNum.orth.fas
> Save paralogues identified (and meeting minsim settings) in PARA/AccNum.para.fas
3. Align sequences with MUSCLE [orthalign]
> ALN/AccNum.orthaln.fas
4. Generate an unrooted tree with (ClustalW or PHYLIP) [orthtree]
> TREE/AccNum.orth.nsf
Optional paralogue/subfamily output: (These are best not used with Force=T or FullForce=T)
2a. Alignment of query protein and any paralogues >minsim threshold (paralign=T/F). The parasplice=T/F controls
whether splice variants are in these paralogue alignments (where identified using AccNum-X notation).
> PARALN/AccNum.paraln.fas
2b. Pairwise combinations of paralogues and their orthologues aligned, with "common" orthologues removed from the
dataset, with a rooted tree and group data for BADASP analysis etc. (parafam=T)
> PARAFAM/AccNum+ParaAccNum.parafam.fas
> PARAFAM/AccNum+ParaAccNum.parafam.nsf
> PARAFAM/AccNum+ParaAccNum.parafam.grp
2c. Combined protein families consisting of a protein, all the paralogues > minsim and all orthologues for each in a
single dataset. Unaligned. (gopherfam=T)
> SUBFAM/AccNum.subfam.fas
*NB.* The subfamily outputs involve Gopher calling itself to ensure the paralogues have gone through the Gopher
process themselves. This could potentially cause conflict if forking is used.
Commandline:
### Basic Input/Output ###
Module: ned_eigenvalues
Description: Modified N. Davey Relative Local Conservation module
Version: 1.0
Last Edit: 03/09/09
Copyright (C) 2009 Norman E. Davey & Richard J. Edwards - See source code for GNU License Notice
Function:
This module is not for standalone running and has no commandline options (including 'help'). All options are handled
by the parent module.
Uses general modules: operator, math, random
Module: ned_rankbydistribution
Description: Modified SLiMFinder stats module
Version: 1.0
Last Edit: 03/09/09
Copyright (C) 2009 Norman E. Davey & Richard J. Edwards - See source code for GNU License Notice
Function:
This module is a stripped down template for methods only. This is for when a class has too many methods and becomes
untidy. In this case, methods can be moved into a methods module and 'self' replaced with the relevant object.
Commandline:
This module is not for standalone running and has no commandline options (including 'help'). All options are handled
by the parent module.
Uses general modules: re, copy, random, math, sys, time, os, pickle, sets, string, traceback
Uses RJE modules: rje_seq, rje_uniprot, rje, rje_blast, rje_slim
Module: rje
Description: Contains General Objects for all my (Rich's) scripts
Version: 3.10
Last Edit: 07/01/10
Copyright (C) 2005 Richard J. Edwards - See source code for GNU License Notice
Function:
General module containing Classes used by all my scripts plus a number of miscellaneous methods.
- Output to Screen, Commandline parameters and Log Files
Commandline options are all in the form X=Y. Where Y is to include spaces, use X="Y".
General Commandline:
Module: rje_aaprop
Description: AA Property Matrix Module
Version: 0.1
Last Edit: 18/05/06
Copyright (C) 2005 Richard J. Edwards - See source code for GNU License Notice
Function:
Takes an amino acid property matrix file and reads into an AAPropMatrix object. Converts in an all by all property
difference matrix. By default, gaps and Xs will be given null properties (None) unless part of input file.
Commandline:
Module: rje_ancseq
Description: Ancestral Sequence Prediction Module
Version: 1.2
Last Edit: 08/01/07
Copyright (C) 2005 Richard J. Edwards - See source code for GNU License Notice
Function:
This module contains the objects and methods for ancestral sequence prediction. Currently, only GASP (Edwards & Shields
2004) is implemented. Other methods may be incorporated in the future.
GASP Commandline:
Module: rje_blast
Description: BLAST Control Module
Version: 1.10
Last Edit: 09/06/09
Copyright (C) 2005 Richard J. Edwards - See source code for GNU License Notice
Function:
Performs BLAST searches and loads results into objects. Peforms GABLAM conversion of local alignments into global
alignment statistics.
Objects:
BLASTRun = Full BLAST run
BLASTSearch = Information for a single Query search within a BLASTRun
BLASTHit = Detailed Information for a single Query-Hit pair within BLASTRun
PWAln = Detailed Information for each aligned section of a Query-Hit Pair
Commandline:
Module: rje_dismatrix
Description: Distance Matrix Module
Version: 2.3
Last Edit: 09/04/08
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
DisMatrix Class. Stores distance matrix data and contains methods for extra calculations, such as MST. If pylab is
installed, a distance matrix can also be turned into a heatmap.
Commandline:
Module: rje_disorder
Description: Disorder Prediction Module
Version: 0.5
Last Edit: 16/05/08
Copyright (C) 2006 Richard J. Edwards - See source code for GNU License Notice
Function:
This module currently has limited function and no standalone capability, though this may be added with time. It is
designed for use with other modules. The disorder Class can be given a sequence and will run the appropriate
disorder prediction software and store disorder prediction results for use in other programs. The sequence will have
any gaps removed.
Currently two disorder prediction methods are implemented:
* IUPred : Dosztanyi Z, Csizmok V, Tompa P & Simon I (2005). J. Mol. Biol. 347, 827-839. This has to be installed
locally. It is available on request from the IUPred website and any use of results should cite the method. (See
http://iupred.enzim.hu/index.html for more details.) IUPred returns a value for each residue, which by default,
is determined to be disordered if > 0.5.
* FoldIndex : This is run directly from the website (http://bioportal.weizmann.ac.il/fldbin/findex) and more simply
returns a list of disordered regions. You must have a live web connection to use this method!
For IUPred, the individual residue results are stored in Disorder.list['ResidueDisorder']. For both methods, the
disordered regions are stored in Disorder.list['RegionDisorder'] as (start,stop) tuples.
Commandline:
### General Options ###
Module: rje_hmm
Description: HMMer Control Module
Version: 1.3
Last Edit: 25/11/08
Copyright (C) 2005 Richard J. Edwards - See source code for GNU License Notice
Function:
This module is designed to perform basic HMM functions using the HMMer program. Currently, there are three functions
that may be performed, separately or consecutively:
* 1. Use hmmbuild to construct HMMs from input sequence files
* 2. Search a sequence database with HMMs files
* 3. Convert HMMer output into a delimited text file of results.
Commandline:
## Build Options ##
Module: rje_menu
Description: Generic Menu Methods Module
Version: 0.2
Last Edit: 20/08/09
Copyright (C) 2006 Richard J. Edwards - See source code for GNU License Notice
Function:
This module is designed to contain generic menu methods for use with any RJE Object. At least, that's the plan...
Commandline:
This module is not for standalone running and has no commandline options (including 'help'). All options are handled
by the parent module.
Uses general modules: os, string, sys
Uses RJE modules: rje
Other modules needed: None
Module: rje_motif
Description: Motif Class and Methods Module
Version: 3.0
Last Edit: 15/02/07
Copyright (C) 2006 Richard J. Edwards - See source code for GNU License Notice
Function:
This module contains the Motif class for use with both Slim Pickings and PRESTO, and associated methods. This basic
Motif class stores its pattern in several forms:
- info['Sequence'] stores the original pattern given to the Motif object
- list['PRESTO'] stores the pattern in a list of PRESTO format elements, where each element is a discrete part of
the motif pattern
- list['Variants'] stores simple strings of all the basic variants - length and ambiguity - for indentifying the "best"
variant for any given match
- dict['Search'] stores the actual regular expression variants used for searching, which has a separate entry for
each length variant - otherwise Python RegExp gets confused! Keys for this dictionary relate to the number of
mismatches allowed in each variant.
The Motif Class is designed for use with the MotifList class. When a motif is added to a MotifList object, the
Motif.format() command is called, which generates the 'PRESTO' list. After this - assuming it is to be kept -
Motif.makeVariants() makes the 'Variants' list. If creating a motif object in another module, these method should be
called before any sequence searching is performed. If mismatches are being used, the Motif.misMatches() method must
also be called.
Commandline:
These options should be listed in the docstring of the module using the motif class:
-
Module: rje_motif_stats
Description: Motif Statistics Methods Module
Version: 1.0
Last Edit: 01/02/07
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
This module contains the Alignment Conservation methods for motifs, as well as other calculations needing occurrence
data. This module is designed to be used by the MotifList class, which contains the relevant commandline options.
Commandline:
This module is not for standalone running and has no commandline options (including 'help'). All options are handled
by the parent module.
Uses general modules: copy, os, string, sys
Uses RJE modules: gopher_V2, rje, rje_blast, rje_disorder, rje_motif_V3, rje_seq, rje_sequence
Other modules needed: rje_seq modules
Module: rje_motiflist
Description: RJE Motif List Module
Version: 1.0
Last Edit: 03/04/07
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
This module contains the MotifList Class, which is designed to replace many of the functions that previously formed
part of the Presto Class. This class will then be used by PRESTO, SLiMPickings and CompariMotif (and others?) to
control Motif loading, redundancy and storage. MotifOcc objects will replace the previous PrestoSeqHit objects and
contain improved data commenting and retrieval methods. The MotifList class will contain methods for filtering motifs
according to individual or combined MotifOcc data.
The options below should be read in by the MotifList object when it is instanced with a cmd_list and therefore do not
need to be part of any class that makes use of this object unless it has conflicting settings.
The Motif Stats options are used by MotifList to calculate statistics for motif occurrences, though this data will
actually be stored in the MotifOcc objects themselves. This includes conservation statistics.
Note. Additional output parameters, such as motifaln and proteinaln settings, and stat filtering/novel scores are not
stored in this object, as they will be largely dependent on the main programs using the class, and the output from
those programs. (This also enables statfilters etc. to be used with stats not related to motifs and their occurrences
if desired.)
MotifList Commands:
## Basic Motif Input/Formatting Parameters ##
Module: rje_motifocc
Description: Motif Occurrence Module
Version: 0.0
Last Edit: 29/01/07
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
This module contains the MotifOcc class. This class if for storing methods and attributes pertinent to an individual
occurrence of a motif, i.e. one Motif instance in one sequence at one position. This class is loosely based on (and
should replace) the old PRESTO PrestoHit object. (And, to some extent, the PrestoSeqHit object.) This class is
designed to be flexible for use with PRESTO, SLiMPickings and CompariMotif, among others.
In addition to storing the standard info and stat dictionaries, this object will store a "Data" dictionary, which
contains the (program-specific) data to be output for a given motif. All data will be in string format. The
getData() and getStat() methods will automatically convert from string to numerics as needed.
Commandline:
This module has no standalone functionality and should not be called from the commandline.
Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje, rje_seq, rje_sequence
Other modules needed: None
Module: rje_pam
Description: Contains Objects for PAM matrices
Version: 1.2
Last Edit: 29/08/06
Copyright (C) 2005 Richard J. Edwards - See source code for GNU License Notice
Function:
This module handles functions associated with PAM matrices. A PAM1 matrix is read from the given input file and
multiplied by itself to give PAM matrices corresponding to greater evolutionary distance. (PAM1 equates to one amino acid
substitition per 100aa of sequence.)
Commandline:
Module: rje_scoring
Description: Scoring and Ranking Methods for RJE Python Modules
Version: 0.0
Last Edit: 22/01/07
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
This module contains methods only for ranking, filtering and generating new scores from python dictionaries. At its
conception, this is for unifying and clarifying the new scoring and filtering options used by PRESTO & SLiMPickings,
though it is conceived that the methods will also be suitable for use in other/future programs.
The general format of expected data is a list of column headers, on which data may be filtered/ranked etc. or
combined to make new scores, and a dictionary containing the data for a given entry. The keys for the dictionary
should match the headers in a *case-insensitive* fashion. (The keys and headers will not be changed but will match
without using case, so do not have two case-sensitive variables, such as "A" and "a" unless they have the same
values.) !NB! For some methods, the case should have been matched.
Methods in this module will either return the input dictionary or list with additional elements (if calculating new
scores) or take a list of data dictionaries and return a ranked or filtered list.
Methods in this module:
* setupStatFilter(callobj,statlist,filterlist) = Makes StatFilter dictionary from statlist and filterlist
* statFilter(callobj,data,statfilter) = Filters data dictionary according to statfilter dictionary.
* setupCustomScores(callobj,statlist,scorelist,scoredict) = Checks and returns Custom Scores and related lists
Commandline:
This module is not for standalone running and has no commandline options (including 'help'). All options are handled
by the parent modules.
Uses general modules: copy, os, string, sys
Uses RJE modules: rje
Other modules needed: None
Module: rje_seq
Description: DNA/Protein sequence module
Version: 3.6
Last Edit: 10/06/09
Copyright (C) 2005 Richard J. Edwards - See source code for GNU License Notice
Function:
Contains Classes and methods for sets of DNA and protein sequences.
Sequence Input/Output Commands:
Module: rje_sequence
Description: DNA/Protein sequence object
Version: 1.13
Last Edit: 26/01/10
Copyright (C) 2006 Richard J. Edwards - See source code for GNU License Notice
Function:
This module contains the Sequence Object used to store sequence data for all PEAT applications that used DNA or
protein sequences. It has no standalone functionality.
This modules contains all the methods for parsing out sequence information, including species and source database,
based on the format of the input sequences. If using a consistent but custom format for fasta description lines,
please contact me and I can add it to the list of formats currently recognised.
Uses general modules: copy, os, random, re, sre_constants, string, sys, time
Uses RJE modules: rje, rje_disorder
Module: RJE_SLiM
Description: Short Linear Motif class module
Version: 1.3
Last Edit: 11/02/10
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
This module contains the new SLiM class, which replaces the old Motif class, for use with both SLiMFinder and
SLiMSearch. In addition, this module encodes some general motif methods. Note that the new methods are not
designed with Mass Spec data in mind and so some of the more complicated regexp designations for unknown amino acid
order etc. have been dropped. Because the SLiM class explicitly deals with *short* linear motifs, wildcard gaps are
capped at a max length of 9.
The basic SLiM class stores its pattern in several forms:
- info['Sequence'] stores the original pattern given to the Motif object
- info['Slim'] stores the pattern as a SLiMFinder-style string of defined elements and wildcard spacers
- dict['MM'] stores lists of Slim strings for each number of mismatches with flexible lengths enumerated. This is
used for actual searches in SLiMSearch.
- dict['Search'] stores the actual regular expression variants used for searching, which has a separate entry for
each length variant - otherwise Python RegExp gets confused! Keys for this dictionary relate to the number of
mismatches allowed in each variant and match dict['MM'].
The following were previously used by the Motif class and may be revived for the new SLiM class if needed:
- list['Variants'] stores simple strings of all the basic variants - length and ambiguity - for indentifying the
"best" variant for any given match
The SLiM class is designed for use with the SLiMList class. When a SLiM is added to a SLiMList object, the
SLiM.format() command is called, which generates the 'Slim' string. After this - assuming it is to be kept -
SLiM.makeVariants() makes the 'Variants' list. If creating a motif object in another module, these method should be
called before any sequence searching is performed. If mismatches are being used, the SLiM.misMatches() method must
also be called.
SLiM occurrences are stored in the dict['Occ'] attribute. The keys for this are Sequence objects and values are
either a simple list of positions (1 to L) or a dictionary of attributes with positions as keys.
Commandline:
These options should be listed in the docstring of the module using the motif class:
-
Module: rje_slimcalc
Description: SLiM Attribute Calculation Module
Version: 0.3
Last Edit: 13/08/08
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
This module is based on the old rje_motifstats module. It is primarily for calculating empirical attributes of SLiMs
and their occurrences, such as Conservation, Hydropathy and Disorder.
Commandline:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Motif Occurrence Attribute Options ###
Module: rje_slimcore
Description: Core module/object for SLiMFinder and SLiMSearch
Version: 1.6
Last Edit: 11/02/10
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
This module is primarily to contain core dataset processing methods for both SLiMFinder and SLiMSearch to inherit and
use. This primarily consists of the options and methods for masking datasets and generating UPC. This module can
therefore be run in standalone mode to generate UPC files for SLiMFinder or SLiMSearch.
In addition, the secondary MotifSeq and Randomise functions are handled here.
Secondary Functions:
The "MotifSeq" option will output fasta files for a list of X:Y, where X is a motif pattern and Y is the output file.
The "Randomise" function will take a set of input datasets (as in Batch Mode) and regenerate a set of new datasets
by shuffling the UPC among datasets. Note that, at this stage, this is quite crude and may result in the final
datasets having fewer UPC due to common sequences and/or relationships between UPC clusters in different datasets.
Commandline: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Basic Input/Output Options ###
Module: rje_slimlist
Description: SLiM dataset manager
Version: 0.4
Last Edit: 06/01/10
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
This module is a replace for the rje_motiflist module and contains the SLiMList class, a replacement for the
MotifList class. The primary function of this class is to load and store a list of SLiMs and control generic SLiM
outputs for such programs as SLiMSearch. This class also controls motif filtering according to features of the motifs
and/or their occurrences.
Commandline: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Basic Input/Output Options ###
Module: rje_tm
Description: Tranmembrane and Signal Peptide Prediction Module
Version: 1.2
Last Edit: 16/08/07
Copyright (C) 2005 Richard J. Edwards - See source code for GNU License Notice
Function:
Will read in results from tmhmm and/or signalp files as appropriate and append output to:
- tm.tdt = TM domain counts and orientation
- domains.tdt = Domain table
- singalp.tdt = SingalP data (use to add signal peptide domains to domains table using mySQL
Commandline:
Module: rje_tree
Description: Phylogenetic Tree Module
Version: 2.6
Last Edit: 11/12/09
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
Reads in, edits and outputs phylogenetic trees. Executes duplication and subfamily determination. More details available
in documentation for HAQESAC, GASP and BADASP at http://www.bioinformatics.rcsi.ie/~redwards/
General Commands:
Module: rje_tree_group
Description: Contains all the Grouping Methods for rje_tree.py
Version: 1.2
Last Edit: 19/08/09
Copyright (C) 2005 Richard J. Edwards - See source code for GNU License Notice
Function:
This module is a stripped down template for methods only. This is for when a class has too many methods and becomes
untidy. In this case, methods can be moved into a methods module and 'self' replaced with the relevant object. For
this module, 'self' becomes '_tree'.
Commandline:
This module is not for standalone running and has no commandline options (including 'help'). All options are handled
by the parent module: rje_tree.py
Uses general modules: copy, re, os, string, sys
Uses RJE modules: rje, rje_seq
Other modules needed: rje_blast, rje_dismatrix, rje_pam, rje_sequence, rje_uniprot
Module: rje_uniprot
Description: RJE Module to Handle Uniprot Files
Version: 3.7
Last Edit: 26/01/10
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
This module contains methods for handling UniProt files, primarily in other rje modules but also with some
standalone functionality. To get the most out of the module with big UniProt files (such as the downloads from EBI),
first index the UniProt data using the rje_dbase module.
This module can be used to extract a list of UniProt entries from a larger database and/or to produce summary tables
from UniProt flat files.
In addition to method associated with the classes of this module, there are a number of methods that are called from
the rje_dbase module (primarily) to download and process the UniProt sequence database.
Input Options:
Module: RJE_XGMML
Description: RJE XGMLL Module
Version: 0.0
Last Edit: 14/11/07
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
This module is currently designed to store data for, and then output, an XGMML file for uploading into Cytoscape etc.
Future versions may incoporate the ability to read and manipulate existing XGMML files.
Commandline:
At present, all commands are handling by the class populating the XGMML object.
Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje
Other modules needed: None
Module: rje_zen
Description: Random Zen Wisdom Generator
Version: 1.0
Last Edit: 15/04/08
Copyright (C) 2007 Richard J. Edwards - See source code for GNU License Notice
Function:
Generates random (probably nonsensical) Zen wisdoms. Just for fun.
Commandline: