ReadMe documentation for release of SLIMSUITE software

Distribution compiled: Mon Apr 16 14:10:02 2012

Questions/Comments?: please contact seqsuite@gmail.com

Python Modules in tools/:

Python Modules in extras/:

Python Modules in libraries/:

Manuals

Other Files


Installation Instructions

  1. Place the slimsuite.tar.gz file in chosen directory (e.g. c:\bioware\) and unpack.
  2. A subdirectory slimsuite will be created containing all the files necessary to run.

The software should run on any system that has Python installed. Additional software may be necessary for full functionality. Further details can be found in the manuals supplied.


GNU License

Copyright (C) 2011 RJ Edwards <seqsuite@gmail.com>

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA

Author contact: <seqsuite@gmail.com> / Centre for Biological Sciences, University of Southampton, UK.

To incorporate this module into your own programs, please see GNU Lesser General Public License disclaimer in rje.py

tools/:

CompariMotif [version 3.5] Motif vs Motif Comparison Software ~ (comparimotif_V3.py)[Top]

Program: CompariMotif
Description: Motif vs Motif Comparison Software
Version: 3.5
Last Edit: 04/08/09
Imports: rje, rje_menu, rje_seq, rje_slim, rje_slimlist, rje_xgmml, rje_zen
Imported By: qslimfinder, slimfinder, rje_slimcore, slimfrap
Citation: Edwards, Davey & Shields (2008), Bioinformatics 24(10):1307-9.
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
CompariMotif is a piece of software with a single objective: to take two lists of protein motifs and compare them to
each other, identifying which motifs have some degree of overlap, and identifying the relationships between those
motifs. It can be used to compare a list of motifs with themselves, their reversed selves, or a list of previously
published motifs, for example (e.g. ELM (http://elm.eu.org/)). CompariMotif outputs a table of all pairs of matching
motifs, along with their degree of similarity (information content) and their relationship to each other.

The best match is used to define the relationship between the two motifs. These relationships are comprised of the
following keywords:

Match type keywords identify the type of relationship seen:
* Exact = all the matches in the two motifs are precise
* Variant = focal motif contains only exact matches and subvariants of degenerate positions in the other motif
* Degenerate = the focal motif contains only exact matches and degenerate versions of positions in the other motif
* Complex = some positions in the focal motif are degenerate versions of positions in the compared motif, while
others are subvariants of degenerate positions.

Match length keywords identify the length relationships of the two motifs:
* Match = both motifs are the same length and match across their entire length
* Parent = the focal motif is longer and entirely contains the compared motif
* Subsequence = the focal motif is shorter and entirely contained within the compared motif
* Overlap = neither motif is entirely contained within the other

This gives sixteen possible classifications for each motif's relationship to the compared motif.

Input:
CompariMotif can take input in a number of formats. The preferred format is SLiMSearch format, while is a single line
motif format: 'Name Sequence #Comments' (Comments are optional and ignored). Alternative inputs include SLiMDisc and
Slim Pickings output, raw lists of motifs, and fasta format.

Output:
The main output for CompariMotif is delimited text file containing the following fields:
* File1 = Name of motifs file (if outstyle=multi)
* File2 = Name of searchdb file (if outstyle=multi)
* Name1 = Name of motif from motif file 1
* Name2 = Name of motif from motif file 2
* Motif1 = Motif (pattern) from motif file 1
* Motif2 = Motif (pattern) from motif file 2
* Sim1 = Description of motif1's relationship to motif2
* Sim2 = Description of motif2's relationship to motif1
* Match = Text summary of matched region
* MatchPos = Number of matched positions between motif1 and motif2 (>= mishare=X)
* MatchIC = Information content of matched positions
* NormIC = MatchA as a proportion of the maximum possible MatchA
* Score = Heuristic score (MatchPos x NormIC) for ranking motif matches
* Info1 = Ambiguity score of motif1
* Info2 = Ambiguity score of motif2
* Desc1 = Description of motif1 (if motdesc = 1 or 3)
* Desc2 = Description of motif2 (if motdesc = 2 or 3)

With the exception of the file names, which are only output if outstyle=multi, the above is the output for the
default "normal" output style. If outstyle=single then only statistics for motif2 (the searchdb motif) are output
as this is designed for searches using a single motif against a motif database. If outstyle=normalsplit or
outstyle=multisplit then motif1 information is grouped together, followed by motif2 information, followed by the
match statistics. More information can be found in the CompariMotif manual.

Commandline:
## Basic Input Parameters ##
* motifs=FILE : File of input motifs/peptides [None]
* searchdb=FILE : (Optional) second motif file to compare. Will compare to self if none given. [None]
* dna=T/F : Whether motifs should be considered as DNA motifs [False]

## Basic Output Parameters ##
* resfile=FILE : Name of results file, FILE.compare.txt. [motifsFILE-searchdbFILE.compare.txt]
* motinfo=FILE : Filename for output of motif summary table (if desired) [None]
* motific=T/F : Output Information Content for motifs [False]

## Motif Comparison Parameters ##
* minshare=X : Min. number of non-wildcard positions for motifs to share [2]
* normcut=X : Min. normalised MatchIC for motif match [0.5]
* matchfix=X : If >0 must exactly match *all* fixed positions in the motifs from: [0]
- 1: input (motifs=FILE) motifs
- 2: searchdb motifs
- 3: *both* input and searchdb motifs
* ambcut=X : Max number of choices in ambiguous position before replaced with wildcard (0=use all) [10]

## Advanced Motif Input Parameters ##
* minic=X : Min information content for a motif (1 fixed position = 1.0) [2.0]
* minfix=X : Min number of fixed positions for a motif to contain [0]
* minpep=X : Min number of defined positions in a motif [2]
* trimx=T/F : Trims Xs from the ends of a motif [False]
* nrmotif=T/F : Whether to remove redundancy in input motifs [False]
* reverse=T/F : Reverse the input motifs. [False]
- If no searchdb given, these will be searched against the "forward" motifs.
* mismatches=X : <= X mismatches of positions can be tolerated [0]
* aafreq=FILE : Use FILE to replace uniform AAFreqs (FILE can be sequences or aafreq) [None]

## Advanced Motif Output Parameters ##
* xgmml=T/F : Whether to output XGMML format results [True]
* pickle=T/F : Whether to load/save pickle following motif loading/filtering [False]
* motdesc=X : Sets which motifs have description outputs (0-3 as matchfix option) [3]
* outstyle=X : Sets the output style for the resfile [normal]
- normal = all standard stats are output
- multi = designed for multiple appended runs. File names are also output
- single = designed for searches of a single motif vs a database. Only motif2 stats are output
- reduced = motifs do not have names or descriptions
- normalsplit/multisplit = as normal/multi but stats are grouped by motif rather than by type

Uses general modules: os, string, sys, time
Uses RJE modules: rje, rje_menu, rje_slim, rje_slimlist, rje_xgmml, rje_zen
Other modules needed: rje_aaprop, rje_dismatrix, rje_seq, rje_blast, rje_pam, rje_sequence, rje_uniprot

comparimotif_V3 Module Version History

    # 0.0 - Initial Compilation.
    # 1.0 - Full working version with menu
    # 1.1 - Added extra output options
    # 2.0 - Reworked for functionality with MotifList instead of PRESTO and using own methods.
    # 2.1 - Minor bug fixing and tidying. Removed matchic=F option. Added score and normcut=X.
    # 3.0 - Replaced rje_motif* modules with rje_slim* modules and improved handling of termini.
    # 3.1 - Added XGMML output.
    # 3.2 - Added mismatches=X option. (NB. mismatch=X is used in SLiMSearch.)
    # 3.3 - Added "Match" column, summarising matches
    # 3.4 - Added a DNA option and AA frequencies.

CompariMotif Class

    CompariMotif Motif-Motif Comparison Class. Author: Rich Edwards (2005).

    Info:str
    - AAFreq = Use FILE to replace uniform AAFreqs (FILE can be sequences or aafreq) [None]    
    - MotInfo = Filename for output of motif summary table (if desired) [None]
    - OutStyle = Output Style for comparison results [normal]
    - SearchDB = Second file of motifs to compare []
        
    Opt:boolean
    - DNA = Whether motifs should be considered as DNA motifs [False]
    - MotifIC  Output Information Content for motifs [False]
    - Pickle = Whether to load/save pickle following motif loading/filtering [False]
    - XGMML = Whether to output XGMML format results [True]

    Stat:numeric
    - NormCut = Min. normalised MatchIC for motif match [0.5]
    - MatchFix = If >0 must match *all* fixed positions in the motifs from:  [0]
        - 1: input (motifs=FILE) motifs
        - 2: searchdb motifs
        - 3: *both* input and searchdb motifs
    - MinShare = Min. number of non-wildcard positions for motifs to share [2]
    - Mismatches = <= X mismatches of positions can be tolerated [0]
    - MotDesc = Sets which motifs have description outputs (0-3 as matchfix option) [3]

    List:list
    - Matches = List of match dictionaries used for Cytoscape output

    Dict:dictionary
    - AAFreq = Dictionary of AA Frequencies (uniform by default)

    Obj:RJE_Objects
    - SlimList = SLiMList object, which controls most command-line parameters!
    - SearchDB = SLiMList object containing compared dictionary
CompariMotif._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
CompariMotif._interpretMatch(self,vmatch)
    Translates vmatch stats into match information for output.
    >> vmatch:tuple = current best match: (ic,assessment1,assessment2,overlap tuple)
    << match:list = [matchtype1,matchtype2,positions matched,information content]
CompariMotif._positionRelationship(self,c1,c2)
    Compares motif elements c1 and c2 and returns their relationship. Each should be a fixed position, an ambiguous
    position, or a wildcard 'X'.
    >> c1:str = Element of variant of Motif1 in motif-motif comparison
    >> c2:str = Element of variant of Motif2 in motif-motif comparison
    << rel:tuple = relationship of (c1 vs c2, c2 vs c1)
CompariMotif._setAttributes(self)
    Sets Attributes of Object.
CompariMotif._updateVMatch(self,vmatch,ic,assessment1,assessment2,overlap)
    Assesses and updates current best variant match stats.
    >> vmatch:tuple = current best match: (ic,assessment1,assessment2)
    >> ic:float = Information Content of current match
    >> assessment1:dict = counts of different match types for motif1
    >> assessment2:dict = counts of different match types for motif2
    << newmatch:tuple = updated best match: (ic,assessment1,assessment2)
CompariMotif.alphabet(self)


CompariMotif.ambVariant(self,el1,el2)
    Returns whether el2 is an ambiguous variant of el1.
CompariMotif.compareXGMML(self,save=True)
    Outputs XGMML format file for Cytoscape.
    >> save:bool = whether to save file (True) or return XGMML object (False) [True]
CompariMotif.motifCompare(self)
    Compares two lists of motifs against each other
CompariMotif.run(self)
    Main run method. Sets up parameters and calls comparison method. Returns XGMML.
CompariMotif.setupRun(self,postmenu=False)
    Sets up runtime parameters.
CompariMotif.speedskip(self,seq1,seq2)
    Speedily checks that seq1 and seq2 *may* be able to meet match requirements.
    << returns True if they should be skipped, else False.

comparimotif_V3 Module Methods

comparimotif_V3.runMain()




comparimotif_V3 Module ToDo Wishlist

    [Y] : Update the docstring to account for new module use.
    [ ] : Add weighted IC and use of AAfreq files

GOPHER [version 2.8] Generation of Orthologous Proteins from High-Throughput Estimation of Relationships ~ (gopher_V2.py)[Top]

Program: GOPHER
Description: Generation of Orthologous Proteins from High-Throughput Estimation of Relationships
Version: 2.8
Last Edit: 12/02/10
Imports: rje, rje_blast, rje_seq, rje_tree, rje_dismatrix_V2
Imported By: presto_V5, rje_yeast, rje_iridis, rje_motif_stats, rje_slimcalc
Citation: Davey, Edwards & Shields (2007), Nucleic Acids Res. 35(Web Server issue):W455-9.
Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice

Function:
This script is designed to take in two sequences files and generate datasets of orthologous sequence alignments.
The first [seqin] sequence set is the 'queries' around which orthologous datasets are to be assembled. This is now
optimised for a dataset consisting of one protein per protein-coding gene, although splice variants should be dealt
with OK and treated as paralogues. This will only cause problems if the postdup=T option is used, which restricts
orthologues returned to be within the last post-duplication clade for the sequence.

The second [orthdb] is the list of proteins from which the orthologues will be extracted. The seqin sequences are
then BLASTed against the orthdb and processed (see below) to retain putative orthologues using an estimation of the
phylogenetic relationships based on pairwise sequences similarities.

NB. As of version 2.0, gopher=FILE has been replaced with seqin=FILE for greater rje python consistency. The allqry
option has been removed. Please cleanup the input data into a desired non-redundant dataset before running GOPHER.
(In many ways, GOPHER's strength is it's capacity to be run for a single sequence of interest rather than a whole
genome, and it is this functionality that has been concentrated on for use with PRESTO and SLiM Pickings etc.) The
output of statistics for each GOPHER run has also been discontinued for now but may be reintroduced with future
versions. The phosalign command (to produce a table of potential phosphorylation sites (e.g. S,T,Y) across
orthologues for special conservation of phosphorylation prediction analyses) has also been discontinued for now.

Version 2.1 has tightened up on the use of rje_seq parameters that were causing trouble otherwise. It is now the
responsibility of the user to make sure that the orthologue database meets the desired criteria. Duplicate accession
numbers will not be tolerated by GOPHER and (arbitrary) duplicates will be deleted if the sequences are the same, or
renamed otherwise. Renaming may cause problems later. It is highly desirable not to have two proteins with the same
accession number but different amino acid sequences. The following commands are added to the rje_seq object when input
is read: accnr=T unkspec=F specnr=F gnspacc=T. Note that unknown species are also not permitted.

The process for dataset assembly is as follows for each protein :

1. BLAST against orthdb [orthblast]
> BLASTs saved in BLAST/AccNum.blast
2. Work through BLAST hits, indentifying paralogues (query species duplicates) and the closest homologue from each
other species. This involves a second BLAST of the query versus original BLAST hits (e-value=10, no complexity
filter). The best sequence from each species is kept, i.e. the one with the best similarity to the query and not part
of a clade with any paralogue that excludes the query. (If postdup=T, the hit must be in the query's post duplication
clade.) In addition hits: [orthfas]
* Must have minimum identity level with Query
* Must be one of the 'good species' [goodspec=LIST]
> Save reduced sequences as ORTH/AccNum.orth.fas
> Save paralogues identified (and meeting minsim settings) in PARA/AccNum.para.fas
3. Align sequences with MUSCLE [orthalign]
> ALN/AccNum.orthaln.fas
4. Generate an unrooted tree with (ClustalW or PHYLIP) [orthtree]
> TREE/AccNum.orth.nsf

Optional paralogue/subfamily output: (These are best not used with Force=T or FullForce=T)
2a. Alignment of query protein and any paralogues >minsim threshold (paralign=T/F). The parasplice=T/F controls
whether splice variants are in these paralogue alignments (where identified using AccNum-X notation).
> PARALN/AccNum.paraln.fas
2b. Pairwise combinations of paralogues and their orthologues aligned, with "common" orthologues removed from the
dataset, with a rooted tree and group data for BADASP analysis etc. (parafam=T)
> PARAFAM/AccNum+ParaAccNum.parafam.fas
> PARAFAM/AccNum+ParaAccNum.parafam.nsf
> PARAFAM/AccNum+ParaAccNum.parafam.grp
2c. Combined protein families consisting of a protein, all the paralogues > minsim and all orthologues for each in a
single dataset. Unaligned. (gopherfam=T)
> SUBFAM/AccNum.subfam.fas
*NB.* The subfamily outputs involve Gopher calling itself to ensure the paralogues have gone through the Gopher
process themselves. This could potentially cause conflict if forking is used.

Commandline:
### Basic Input/Output ###
* seqin=FILE : Fasta file of 'query' sequences for orthology discovery []
* orthdb=FILE : Fasta file with pool of sequences for orthology discovery []. Should contain query sequences.
* startfrom=X : Accession Number / ID to start from. (Enables restart after crash.) [None]
* dna=T/F : Whether to analyse DNA sequences (not optimised) [False]

### GOPHER run control parameters ###
orthblast : Run to blasting versus orthdb (Stage 1).
orthfas : Run to output of orthologues (Stage 2).
orthalign : Run to alignment of orthologues (Stage 3).
orthtree : Run to tree-generation (Stage 4). [default!]

### GOPHER Orthologue identifcation Parameters ###
* postdup=T/F : Whether to align only post-duplication sequences [False]
* minsim=X : Minimum %similarity of Query for each "orthologue" [40.0]
* simfocus=X : Style of similairy comparison used for MinSim and "Best" sequence identification [query]
- query = %query must > minsim (Best if query is ultimate focus and maximises closeness of returned orthologues)
- hit = %hit must > minsim (Best if lots of sequence fragments are in searchdb and should be retained)
- either = %query > minsim OR %hit > minsim (Best if both above conditions are true)
- both = %query > minsim AND %hit > minsim (Most stringent setting)
* gablamo=X : GABLAMO measure to use for similarity measures [Sim]
- ID = %Identity (from BLAST)
- Sim = %Similarity (from BLAST)
- Len = %Coverage (from BLAST)
* goodX=LIST : Filters where only sequences meeting the requirement of LIST are kept.
LIST may be a list X,Y,..,Z or a FILE which contains a list [None]
- goodacc = list of accession numbers
- goodseq = list of sequence names
- goodspec = list of species codes
- gooddb = list of source databases
- gooddesc = list of terms that, at least one of which must be in description line
* badX=LIST : As goodX but excludes rather than retains filtered sequences

### Additional run control options ###
* repair=T/F : Repair mode - replace previous files if date mismatches or files missing.
(Skip missing files if False) [True]
* force=T/F : Whether to force execution at current level even if results are new enough [False]
* fullforce=T/F : Whether to force current and previous execution even if results are new enough [False]
* dropout=T/F : Whether to "drop out" at earlier phases, or continue with single sequence [False]
* ignoredate=T/F : Ignores the age of files and only replaces if missing [False]
* savespace=T/F : Save space by deleting intermediate blast files during orthfas [True]
* maxpara=X : Maximum number of paralogues to consider (large gene families can cause problems) [50]

### Additional Output Options ###
* runpath=PATH : Specify parent directory in which to output files [./]
* paralign=T/F : Whether to produce paralogue alignments (>minsim) in PARALN/ (assuming run to orthfas+) [False]
* parasplice=T/F : Whether splice variants (where identified) are counted as paralogues [False]
* parafam=T/F : Whether to paralogue paired subfamily alignments (>minsim) (assuming run to orthfas+) [False]
* gopherfam=T/F : Whether to combined paralogous gopher orthologues into protein families (>minsim) (assuming run to orthfas+) [False]
* sticky=T/F : Switch on "Sticky Orthologous Group generation" [False]
* stiggid=X : Base for Stigg ID numbers [STIGG]

Uses general modules: copy, gc, glob, os, string, sys, threading, time
Uses RJE modules: rje, rje_blast, rje_dismatrix, rje_seq, rje_tree
Other modules needed: rje_ancseq, rje_pam, rje_sequence, rje_tree_group, rje_uniprot

gopher_V2 Module Version History

    # 0.0 - Initial Working Compilation.
    # 1.0 - Preliminary working version up to alignment stage
    # 2.0 - Complete reworking of main _orthFas() method. See archived GOPHER 1.9 for history and obselete options.
    # 2.1 - Updated the parafam etc. methods to work with the new ensloci data
    # 2.2 - Added dna=T option.
    # 2.3 - Added soaplab=T settings.
    # 2.4 - Tidied up a bit.
    # 2.5 - Added soaplab file cleanup.
    # 2.6 - Replaced Query sequence with input to avoid shared ID problems.
    # 2.7 - Added sticky Orthologue method and maxpara=X.
    # 2.8 - Made dropout an option.

Gopher Class

    Gopher main controller class. Author: Rich Edwards (2005). This class contains a lot of forking options for use on
    certain machines with multiple processors.    

    Info:str
    - Name = Name of Input sequence file (gopher=FILE)
    - OrthDB = Name of sequence database to fnd orthologues in (orthdb=FILE)
    - StartFrom = AccNum/ID to start running from
    - StiggID = Base for Stigg ID numbers [STIGG]
    - Gopher = Alternative input file name - use only if seqin in ['','None']
    
    Opt:boolean
    - DNA = Whether to analyse DNA sequences (not optimised) [False]
    - DropOut = Whether to "drop out" at earlier phases, or continue with single sequence [False]
    - IgnoreDate = Ignores the age of files and only replaces if missing [False]
    - Sticky = Switch on "Sticky Orthologous Group generation" [False]

    Stat:numeric

    List:list

    Dict:dictionary    

    Obj:RJE_Objects
    - SeqIn = SeqList object for handling input sequences
    - BLAST = BlastRun object for handling BLAST searches
Gopher.OLDstiGG(self,gopherseq,gopherblast)
    Generate and Process StGG dictionary from orthology files.
Gopher._activeForks(self,pidlist=[])
    Checks Process IDs of list and returns list of those still running.
    >> pidlist:list of integers = Process IDs
Gopher._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
Gopher._logTransfer(self,accnum='')
    Transfers details from accnum.log to self.log.
    accnum:str = leader for accnum.log
Gopher._setAttributes(self)
    Sets Attributes of Object.
Gopher.getOrth(self,stigg,gopherseq,gopherblast,accdict)


Gopher.getSticky(self,stigg,gopherseq,gopherblast,accdict)


Gopher.gopherRejects(self)
    Returns list of rejects from gopher.rejects.
Gopher.run(self,runpath=None)
    Main Gopher run, including forking.
Gopher.setMode(self)


Gopher.setupBlast(self)
    Sets up and self.obj['BLAST'] object.
Gopher.stiGG(self,gopherseq,gopherblast)
    Generate and Process StGG dictionary from orthology files.
Gopher.stiGGMe(self,seq=None,acc=None)
    Process StGG dictionary etc.
Gopher.stickySetup(self,seqfile,startfrom)
    Sets up special StiGG (Sticky GOPHER Groups) options.

GopherFork Class

    GopherFork Class. Author: Rich Edwards (2005).

    This class is designed to handle each forking of the main Gopher program. This will carry out the various stages
    associated with each sequence read by the main gopher method.

    Info:str
    - Name = Sequence ShortName
    - OrthDB = Fasta file with pool of sequences for orthology discovery [orthdb.fas]
    - SimFocus = Style of MinSim used [query]
        - query = %query must > minsim (Best if query is ultimate focus and maximises closeness of returned orthologues)
        - hit = %hit must > minsim (Best if lots of sequence fragments are in searchdb and should be retained)
        - either = %query > minsim OR %hit > minsim (Best if both above conditions are true)
        - both = %query > minsim AND %hit > minsim (Gets most similar sequences in terms of length)
    - GABLAMO Key = GABLAMO measure to use for similarity measures [Sim]
        - ID = %Identity (from BLAST)
        - Sim = %Similarity (from BLAST)
        - Len = %Coverage (from BLAST)
    - Mode = Gopher mode being run for fork [None]
    
    Opt:boolean
    - DropOut = Whether to "drop out" at earlier phases, or continue with single sequence [False]
    - Force = Whether to force execution even if results are new enough [False]
    - NoExec = Whether to simply report what *would* be executed [False]
    - PostDup = Whether to align only post-duplication sequences [True]
    - IgnoreDate = Ignores the age of files and only replaces if missing [False]
    - SaveSpace = Save space by deleting intermediate blast files during orthfas [True]
    - Paralign = Whether to produce paralogue alignments (>minsim) in PARALN/ [True]
    - ParaSplice = Whether to allow splice variants in paralogue alignments (where identified) [False]
    - ParaFam = Whether to paralogue paired subfamily alignments (>minsim) (assuming run to orthfas+) [False]
    - GopherFam = Whether to combined paralogous gopher orthologues into protein families (>minsim) (assuming run to orthfas+) [False]
    - Sticky = Switch on "Sticky Orthologous Group generation" [False]

    Stat:numeric
    - MinSim = Min %Sim for orth vs qry or qry vs orth (i.e. shorter versus longer)
    - MaxPara = Maximum number of paralogues to consider (large gene families can cause problems) [50]

    List:list

    Dict:dictionary    

    Obj:RJE_Objects
    - Sequence:rje_seq.Sequence = 'Query' Sequence
    - BLAST:rje_blast.BLASTRun = Blast Run. (If None, invoke 'gopher' mode)

    Run modes:
    orthblast   :   Run to blasting versus orthdb (Stage 1).
    orthfas     :   Run to output of orthologues (Stage 2). 
    orthalign   :   Run to alignment of orthologues (Stage 3). 
    orthtree    :   Run to generation of trees (Stage 4). 
    orthanc     :   Run to generation of AncSeqs (Stage 5, GASP).
GopherFork._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
GopherFork._gopherFam(self)
    Makes a combined protein family dataset (unaligned) consisting of a protein, all the paralogues > minsim and all
    orthologues for each in a single dataset.
GopherFork._needStage(self,stage,needfiles,prevfile)
    Generic method for checking a stage has been completed OK.
    >> stage:str = current stage
    >> needfiles:list = list of filenames produced by stage in order created
    >> prevfile:str = filename for last (i.e. stats?) output of previous stage
GopherFork._orthAln(self)
    Makes an aligned Fasta file of potential orthologues.
    5. Align sequences with MUSCLE
        > ALN/AccNum.orthaln.fas
GopherFork._orthBlast(self,compf=True)
    Blasts Query against OrthDB and populates BLAST object.
    1. BLAST against orthdb.
    2. Read BLAST and save sequence IDs as BLAST/AccNum.blast.id
    .. Output stats to *.gopher_blast
    << Returns True if Query OK and BLAST OK. Otherwise, returns False.
GopherFork._orthFas(self)
    Makes an unaligned Fasta file of potential orthologues.
        > Save reduced sequences as ORTH/AccNum.orth.fas
GopherFork._orthTree(self)
    Makes an unrooted (by default) tree using rje_tree.py.
GopherFork._parAlign(self,qseq=None,paraseq=None,gablamo_matrix=None)
    Makes an aligned Fasta file of close paralogues: PARALN/accnum.paraln.fas
    >> qseq:Sequence Object (Query)
    >> paraseq:list of Sequence Objects (Paralogues)
    >> gablamo_matrix:dictionary of GABLAMO distances (seq1:seq2)
GopherFork._paraFam(self)
    Makes a rooted tree of two paralogous gopher subfams using rje_tree.py.
GopherFork._phosAlign(self)
    Make potential phosphorylation site alignment tables (Extra).
GopherFork._runStage(self,stage)
    Generic method for checking a stage has been completed OK.
    >> stage:str = current stage
    << returns False if ending here due to failure etc. or True if stage completed and ready for next stage.
GopherFork._setAttributes(self)
    Sets Attributes of Object.
GopherFork.run(self,mode)
    Main GopherFork Run method. Calls _runStage(mode) as appropriate.

gopher_V2 Module Methods

gopher_V2.runMain()




gopher_V2 Module ToDo Wishlist

    # [ ] : Add a reciprocal best hit method (Using any stat including BLAST score) for comparison
    # -- [ ] : Add a method=X option for GOPHER vs Sticky vs PostDup vs MBH.
    # -- [ ] : Add option to read in distances from file.
    # [ ] : Consider re-introduction of GOPHER Statistics
    # [ ] : Consider addition of a menu front-end
    # [ ] : Add or scrap orthanc (GASP) run mode. (Replaced by gopherfam etc?)
    # [ ] : Consider automated BADASP analysis?
    # [ ] : Add automated GABLAM table generation for GOPHER alignments?

PRESTO [version 5.0] Protein Regular Expression Search Tool ~ (presto_V5.py)[Top]

Program: PRESTO
Description: Protein Regular Expression Search Tool
Version: 5.0
Last Edit: 22/01/07
Imports: rje, rje_aaprop, rje_blast, rje_disorder, rje_motif_V3, rje_motif_stats, rje_pam, rje_scoring, rje_seq, rje_sequence, rje_uniprot, rje_motiflist, gopher_V2
Imported By: None
Note: This program has been superceded in most functions by SLiMSearch.
Copyright © 2006 Richard J. Edwards - See source code for GNU License Notice

Function:
PRESTO is what the acronym suggests: a search tool for searching proteins with peptide sequences or motifs using an
algorithm based on Regular Expressions. The simple input and output formats and ease of use on local databases make
PRESTO a useful alternative to web resources for high throughput studies.

The additional benefits of PRESTO that make it more useful than a lot of existing tools include:
* PRESTO can be given alignment files from which to calculate conservation statistics for motif occurrences.
* searching with mismatches rather than restricting hits to perfect matches.
* additional statistics, inlcuding protein disorder, surface accessibility and hydrophobicity predictions
* production of separate fasta files containing the proteins hit by each motif.
* production of both UniProt format results and delimited text results for easy incorporation into other applications.
* inbuilt tandem Mass Spec ambiguities.

PRESTO recognises "n of m" motif elements in the form , where X is one or more amino acids that must occur n+
times across which m positions. E.g. must have 3+ Is and/or Ls in a 5aa stretch.

Main output for PRESTO is a delimited file of motif/peptide occurrences but the motifaln=T and proteinaln=T also allow
output of alignments of motifs and their occurrences. PRESTO has an additional motinfo=FILE output, which produces a
summary table of the input motifs, inlcuding Expected values if searchdb given and information content if motifIC=T.
Hit proteins can also be output in fasta format (fasout=T) or UniProt format with occurrences as features (uniprot=T).

Release Notes:
Expectation scores have now been modified since PRESTO Version 1.x. In addition to the expectation score for the no.
of occurrences of a given motif (given the number of mismatches) in the entire dataset ("EXPECT"), there is now an
estimation of the probability of the observed number of occurrences, derived from a Poisson distribution, which is
output in the log file ("#PROB"). Further more, these values are now also calculated per sequence individually
("SEQ_EXP" and "SEQ_PROB").

Note on MS-MS mode: The old Perl version of Presto had a handy MS-MS mode for searching peptides sequenced from tandem
mass-spec data. (In this mode [msms=T], amino acids of equal mass (Leu-Ile [LI], Gln-Lys [QK], MetO-Phe [MF]) are
automatically placed as possible variants and additional output columns give information of predicted tryptic fragment
masses etc.) Implementation of MS-MS mode has been started in this version but discontinued due to lack of demand. As a
result, extra tryptic fragment data is not produced. If you would like to use it, contact me at richard.edwards@ucd.ie
and I will finish implementing it.

Note for compare=T mode: This is still fully functional but main documentation has been moved to comparimotif.py.

!!!NEW!!! for version 3.7, PRESTO has an additional domfilter=FILE option. This is quite crude and will read in domains
to be filtered from the FILE given. This file MUST be tab-delimited and must have at least three columns, with headers
'Name','Start' and 'Stop', where Name matches the short name of the Hit and 'Start' and 'End' are the positions of the
domain 1-N. This will output two additional columns, plus a further two if iupred=T:
* DOM_MASK = Gives the motif a score of the length of the domain if it would be masked out by masking domains or 0 if not
* DOM_PROP = Gives the proportion of motif positions in a domain
* DOM_DIS = Gives the motif the mean disorder score for the *domain* if in the domain, else 1.0 if not
* DOM_COMB = Gives positions in the domain the mean disorder score for the domain, else they keep their own scores

!!!NEW!!! for version 4.0, PRESTO has a Peptide design mode (peptides=T), using winsize=X to set size of peptides around
occurrences. This will output peptide sequences into a fasta file and additional columns to the main PRESTO output file:
* PEP_SEQ = Sequence of peptide
* PEP_DESIGN = Peptide design comments. "OK" if all looking good, else warnings bad AA combos (DP, DC, DG, NG, NS or PP)

Development Notes: (To be assimilated with release notes etc. when version is fully functional.)
Main output is now determined by outfile=X and/or basefile=X, which will set the self.info['Basefile'] attribute,
using standard rje module commands. If it is not set (i.e. is '' or 'None'), it will be generated using the motif and
searchdb files as with the old PRESTO. Main search output will use this file leader and add the appropriate extension based
on the output type and delimiter:
* resfile = Main PRESTO search = *.presto.tdt
* motifaln = Produce fasta files of local motif alignments *.motifaln.fas
* uniprot = Output of hits as a uniprot format file = *.uniprot.presto
* motinfo = Motif summary table = *.motinfo.tdt
* ftout = Make a file of UniProt features for extracted parent proteins, where possible, incoroprating SLIMs [*.features.tdt]
* peptides = Peptides designed around motifs = *.peptides.fas

Other special output will generate their names using protein and/or motif names using the root PATH of basefile
(e.g. the PATH will be stripped and ProteinAln/ or HitFas/ directories made for output):
* * proteinaln=T/F : Search for alignments of proteins containing motifs and produce new file containing motifs in [False]
* * fasout=T/F : Whether to output hit sequences as a fasta format file motif.fas [False]

Reformatting and ouputting motifs require a file name to be given:
* * motifout=FILE : Filename for output of reformatted (and filtered?) motifs in PRESTO format [None]



PRESTO Commands:
## Basic Input Parameters ##
* motifs=FILE : File of input motifs/peptides [None]
Single line per motif format = 'Name Sequence #Comments' (Comments are optional and ignored)
Alternative formats include fasta, SLiMDisc output and raw motif lists.
* minpep=X : Min length of motif/peptide X aa [2]
* minfix=X : Min number of fixed positions for a motif to contain [0]
* minic=X : Min information content for a motif (1 fixed position = 1.0) [2.0]
* trimx=T/F : Trims Xs from the ends of a motif [False]
* nrmotif=T/F : Whether to remove redundancy in input motifs [False]
* searchdb=FILE : Protein Fasta file to search (or second motif file to compare) [None]
* xpad=X : Adds X additional Xs to the flanks of the motif (after trimx if trimx=T) [0]
* xpaddb=X : Adds X additional Xs to the flanks of the search database sequences (will mess up alignments) [0]
* minimotif=T/F : Input file is in minimotif format and will be reformatted (PRESTO File format only) [False]
* goodmotif=LIST : List of text to match in Motif names to keep (can have wildcards) []

## Basic Output Parameters ##
* outfile=X : Base name of results files, e.g. X.presto.tdt. [motifsFILE-searchdbFILE.presto.tdt]
* expect=T/F : Whether to give crude expect values based on AA frequencies [True]
* nohits=T/F : Save list of sequence IDs without motif hits to *.nohits.txt. [False]
* useres=T/F : Whether to append existing results to *.presto.txt and *.nohits.txt (continuing afer last sequence)
and/or use existing results in to search for conservation in alignments if usealn=T. [False]
* mysql=T/F : Output results in mySQL format - lower case headers and no spaces [False]
* hitname=X : Format for Hit Name: full/short/accnum [short]
* fasout=T/F : Whether to output hit sequences as a fasta format file motif.fas [False]
* datout=T/F : Whether to output hits as a uniprot format file *.uniprot.presto [False]
* motinfo=T/F : Whether to output motif summary table *.motinfo.tdt [None]
* motifout=FILE : Filename for output of reformatted (and filtered?) motifs in PRESTO format [None]

## Advanced Output Options ##
* winsa=X : Number of aa to extend Surface Accessibility calculation either side of motif [0]
* winhyd=X : Number of aa to extend Eisenberg Hydrophobicity calculation either side of motif [0]
* windis=X : Extend disorder statistic X aa either side of motif (use flanks *only* if negative) [0]
* winchg=X : Extend charge calculations (if any) to X aa either side of motif [0]
* winsize=X : Sets all of the above window sizes (use flanks *only* if negative) [0]
* slimchg=T/F : Calculate Asolute, Net and Balance charge statistics (above) for occurrences [False]
* iupred=T/F : Run IUPred disorder prediction [False]
* foldindex=T/F : Run FoldIndex disorder prediction [False]
* iucut=X : Cut-off for IUPred results (0.0 will report mean IUPred score) [0.0]
* iumethod=X : IUPred method to use (long/short) [short]
* iupath=PATH : The full path to the IUPred exectuable [c:/bioware/iupred/iupred.exe]
* domfilter=FILE : Use the DomFilter options, reading domains from FILE [None]
* runid=X : Adds an additional Run_ID column identifying the run (for multiple appended runs [None]
* restrict=LIST : List of files containing instances (hit,start,end) to output (only) []
* exclude=LIST : List of files containing instances (hit,start,end) to exclude []
* peptides=T/F : Peptide design mode, using winsize=X to set size of peptides around motif [False]
* newscore=LIST : Lists of X:Y, create a new statistic X, where Y is the formula of the score. []

## Basic Search Parameters ##
* mismatch=X,Y : Peptide must be >= Y aa for X mismatches
* ambcut=X : Cut-off for max number of choices in ambiguous position to be shown as variant [10]
* expcut=X : The maximum number of expected occurrences allowed to still search with motif [0] (if -ve, per seq)
* alphabet=X,Y,.. : List of letters in alphabet of interest [AAs]
* reverse=T/F : Reverse the motifs - good for generating a test comparison data set [False]
*** No longer outputs *.rev.txt - use motifout=X instead! ***

* msms=T/F : Whether searching Tandem Mass Spec peptides [False]
* ranking=T/F : Whether to rank hits by their rating in MSMS mode [False]
* memsaver=T/F : Whether to store all results in Objects (False) or clear as search proceeds (True) [True]
* startfrom=X : Accession Number / ID to start from. (Enables restart after crash.) [None]

## Conservation Parameters ##
* usealn=T/F : Whether to search for and use alignemnts where present. [False]
* gopher=T/F : Use GOPHER to generate missing orthologue alignments in alndir - see gopher.py options [False]
* fullforce=T/F : Force GOPHER to re-run even if alignment exists [False]
* alndir=PATH : Path to alignments of proteins containing motifs [./] * Use forward slashes (/)
* alnext=X : File extension of alignment files, accnum.X [aln.fas]
* alngap=T/F : Whether to count proteins in alignments that have 100% gaps over motif (True) or (False) ignore
as putative sequence fragments [False] (NB. All X regions are ignored as sequence errors.)
* conspec=LIST : List of species codes for conservation analysis. Can be name of file containing list. [None]
* conscore=X : Type of conservation score used: [pos]
- abs = absolute conservation of motif using RegExp over matched region
- pos = positional conservation: each position treated independently
- prop = conservation of amino acid properties
- all = all three methods for comparison purposes
* consamb=T/F : Whether to calculate conservation allowing for degeneracy of motif (True) or of fixed variant (False) [True]
* consinfo=T/F : Weight positions by information content (does nothing for conscore=abs) [True]
* consweight=X : Weight given to global percentage identity for conservation, given more weight to closer sequences [0]
- 0 gives equal weighting to all. Negative values will upweight distant sequences.
* posmatrix=FILE : Score matrix for amino acid combinations used in pos weighting. (conscore=pos builds from propmatrix) [None]
* aaprop=FILE : Amino Acid property matrix file. [aaprop.txt]
* consout=T/F : Outputs an additional result field containing information on the conservation score used [False]

## Additional Output for Extracted Motifs ##
* motific=T/F : Output Information Content for motifs [False]
* motifaln=T/F : Produce fasta files of local motif alignments [False]
* proteinaln=T/F : Search for alignments of proteins containing motifs and produce new file containing motifs [False]
* protalndir=PATH : Directory name for output of protein aligments [ProteinAln/]
* flanksize=X : Size of sequence flanks for motifs [30]
* xdivide=X : Size of dividing Xs between motifs [10]
* ftout=T/F : Make a file of UniProt features for extracted parent proteins, where possible, incoroprating SLIMs [*.features.tdt]
* unipaths=LIST : List of additional paths containing uniprot.index files from which to look for and extract features ['']
* statfilter=LIST : List of stats to filter (*discard* occurrences) on, consisting of X*Y where:
- X is an output stat (the column header),
- * is an operator in the list >, >=, !=, =, >= ,< !!! Remember to enclose in "quotes" for <> !!!
- Y is a value that X must have, assessed using *.
This filtering is crude and may behave strangely if X is not a numerical stat!

## Motif Comparison Parameters ##
* compare=T/F : Compare the motifs from the motifs FILE with the searchdb FILE (or self if None) [False]
* minshare=X : Min. number of non-wildcard positions for motifs to share [2]
* matchfix=X : If >0 must exactly match *all* fixed positions in the motifs from: [0]
- 1: input (motifs=FILE) motifs
- 2: searchdb motifs
- 3: *both* input and searchdb motifs
* matchic=T/F : Use (and output) information content of matched regions to asses motif matches [True]
* motdesc=X : Sets which motifs have description outputs (0-3 as matchfix option) [3]
* outstyle=X : Sets the output style for the resfile [normal]
- normal = all standard stats are output
- multi = designed for multiple appended runs. File names are also output
- single = designed for searches of a single motif vs a database. Only motif2 stats are output
- normalsplit/multisplit = as normal/multi but stats are grouped by motif rather than by type

Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje, rje_aaprop, rje_disorder, rje_motif_V3, rje_motif_cons, rje_scoring, rje_seq, rje_sequence,
rje_blast, rje_pam, rje_uniprot
Other modules needed: rje_dismatrix,

Presto Class

    PRESTO Class. Author: Rich Edwards (2005).

    Handles main thread for PRESTO motif searches.    

    Info:str
    - Name = Name of input motif/peptide file
    - SearchDB = Name of protein fasta file to search for motifs
    - AlnDir = Path to alignment files
    - AlnExt = File extensions of alignments: AccNum.X
    - ResFile = Base for results files *.presto.txt and *.nohits.txt
    - StartFrom = Accession Number / ID to start from. (Enables restart after crash.) 
    - UniPaths = List of paths containing uniprot.index files from which to look for and extract features ['']
    - ConScore = Type of conservation score used:  [abs]
        - abs = absolute conservation of motif: reports percentage of homologues in which conserved
        - prop = conservation of amino acid properties
        - pam = calculated a weighted conservation based on PAM distance between sequence
    - PosMatrix = Score matrix for amino acid combinations used in pos weighting. (conscore=pos builds from propmatrix)
    - OutStyle = Sets the output style for the resfile [normal]
        - normal = all standard stats are output
        - multi = designed for multiple appended runs. File names are also output
        - single = designed for searches of a single motif vs a database. Only motif2 stats are output
        - normalsplit/multisplit = as normal/multi but stats are grouped by motif rather than by type
    - ProtAlnDir = Directory name for output of protein aligments [ProteinAln/]
    - HitName = Format for Hit Name: full/short/accnum [short]
    - DomFilter = Use the DomFilter options, reading domains from FILE [None]
    - RunID = Adds an additional Run_ID column identifying the run (for multiple appended runs [None]
    - MotifOut = Filename for output of reformatted (and filtered?) motifs in PRESTO format [None]
    
    Opt:boolean
    - MotInfo = Output motif summary table [False]
    - NRMotif = Whether to remove redundancy in input motifs [False]
    - Expect = Whether to calculate crude 'expected' values based on AA composition.
    - MSMS = Whether to run in MSMS mode
    - Ranking = Whether to rank hits in MSMS mode. [Currently not implemented.]
    - UseAln = Whether to look for conservation in alignments
    - UseRes = Whether to use existing results files
    - NoHits = Whether to save list of sequences that lack hits
    - FasOut = whether to output hit sequences to fasta file
    - DatOut = whether to output hits to uniprot format file
    - Reverse = Reverse the motifs - good for generating a test comparison data set [False]
    - FTOut = Make a file of UniProt features for extracted parent proteins, where possible, incoroprating SLIMs [True]
    - MotifAln = Produce fasta files of local motif alignments [True]
    - ProteinAln = Search for alignments of proteins containing motifs and produce new file containing motifs [False]
    - Compare = Compare the motifs from the motifs FILE with the searchdb FILE (or self if None) [False]
    - MatchIC = Use (and output) information content of matched regions to asses motif matches [True]
    - MotifIC = Output Information Content for motifs [False]
    - IUPred = Run IUPred disorder prediction [False]
    - FoldIndex = Run FoldIndex disorder prediction [False]
    - ConsInfo = Weight positions by information content [True]
    - AlnGap = Whether to count proteins in alignments that have 100% gaps over motif (True) or (False) ignore as putative sequence fragments [True]
    - Gopher = Use GOPHER to generate missing orthologue alignments in outdir/Gopher - see gopher.py options [False]
    - FullForce = Force GOPHER to re-run even if alignment exists [False]
    - ConsAmb = Whether to calculate conservation allowing for degeneracy of motif (True) or of fixed variant (False) [True]
    - ConsOut = Outputs an additional result field containing information on the conservation score used [False]
    - TrimX = Trims Xs from the ends of a motif
    - Searched = Whether PRESTO has been used to search a database with a list of motifs
    - SlimChg = Calculate Asolute, Net and Balance charge statistics (above) for occurrences [False]
    - Peptides = Peptide design mode, using winsize=X to set size of peptides around motif [False]
    - MiniMotif = Input file is in minimotif format and will be reformatted [False]
    - Search = Whether to search database with motifs [True]
    
    Stat:numeric
    - AmbCut = Cut-off for max number of choices in ambiguous position to be shown as variant [10]
        For mismatches, this is the max number of choices for an ambiguity to be replaced with a mismatch wildcard
    - ExpCut = The maximum number of expected occurrences allowed to still search with motif [0]
    - MinPep = Minimum length of motif/peptide (non-X characters)
    - MinFix = Min number of fixed positions for a motif to contain [0]
    - MinIC = Min information content for a motif (1 fixed position = 1.0) [2.0]
    - FlankSize = Size of sequence flanks for motifs in MotifAln [30]
    - XDivide = Size of dividing Xs between motifs [10]
    - MinShare = Min. number of non-wildcards for motifs to share in Compare [2]
    - MatchFix = If >0 must match *all* fixed positions in the motifs from:  [0]
        - 1: input (motifs=FILE) motifs
        - 2: searchdb motifs
        - 3: *both* input and searchdb motifs
    - ConsWeight = Weight given to global percentage identity for conservation, given more weight to closer sequences [0]
    - MotDesc = Sets which motifs have description outputs (0-3 as matchfix option) [3]
    - XPad = Adds X additional Xs to the flanks of the motif (after trimx if trimx=T) [0]
    - XPadDB = Adds X additional Xs to the flanks of the search database sequences [0]

    List:list
    - Alphabet = List of letters in alphabet of interest
    - Headers = Column headers for data output. Stored for ease of results output   #!# Eliminate when dict functioning #!#
    - Motifs = List of rje_motif_V3.Motif objects
    - SeqHits = List of PrestoSeqHit objects (if memsaver=F)
    ## Advanced Filtering/Ranking Options ##
    - StatFilter = List of stats to filter on, consisting of X*Y where:
          - X is an output stat (the column header),
          - * is an operator in the list >, >=, =, <= ,<, !=
          - Y is a value that X must have, assessed using *.
          This filtering is crude and may behave strangely if X is not a numerical stat!
    - Restrict = List of files containing instances (hit,start,end) to output (only) []
    - Exclude = List of files containing instances (hit,start,end) to exclude []
    - GoodMotif = List of text to match in Motif names to keep (can have wildcards) []
    - NewScore = self.dict['NewScore'] keys() in order they were read in

    Dict:dictionary
    - Expect = Dictionary of {Motif:{mm:exp}}
    - MisMatch = Dictionary of mismatches X:Y
    - ConsSpecLists = Dictionary of {BaseName:List} lists of species codes for special conservation analyses
    - OccCount = Dictionary of {Motif:Number of occurrences in dataset}
    - MotifOcc = Dictionary of {Sequence:{Motif:{Pos:Match}}}   (Pos is from 0 to L-1)
    - MotOccVar = Dictionary of {Sequence:{Motif:{Pos:Variant}}}   (Pos is from 0 to L-1)
    - ElementIC = Dictionary of {Position Element:IC}
    - PosMatrix = Score matrix for amino acid combinations used in pos weighting. (conscore=pos builds from propmatrix) {}
    - StatFilter = Dictionary of stat filters made from self.list['StatFilter']
    - DomFilter = Dictionary of {HitName:list of domains [(start,end)] arranged in length order}
    - Restrict = Dictionary of instances {motif:[(hit,start,end)]} to output (only) []
    - Exclude = Dictionary of instances {motif:[(hit,start,end)]} to exclude []
    - Headers = Dictionary of {Outputfiletype:[Headers]}    # See module output dictionary for types and extensions #
    - NewScore = dictionary of {X:Y} for new statistic X, where Y is the formula of the score. []

    Obj:RJE_Objects
    - AAPropMatrix = rje_aaprop.AAPropMatrix object
    - PAM = rje_pam.PamCtrl object for PAM conservation
Presto._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
Presto._cmdMismatch(self)
    Sets up the search Mismatchdic from cmd_list.
Presto._setAttributes(self)
    Sets Attributes of Object
Presto.motInfoHeaders(self)
    Sets up Motif Info Table headers.
Presto.motifNum(self)


Presto.motifs(self)


Presto.prestoHeaders(self)
    Sets up main results file headers from Options.
Presto.processSeqOccs(self,occlist)
    Processes Occurrences after search - output to results file(s).
    >> occlist:list of MotifOcc objects - must have same obj['Seq']
    << returns filtered occlist
Presto.run(self)
    General Run method for Presto:
    0. Check relevant files
    1. Load motifs, Reformat and check for redundancy, Make (mismatch) variants
    2. Search Database and output results (or output MotInfo table)
Presto.runMenu(self)


Presto.searchDB(self)
    Main method for searching database with motifs.
Presto.searchSeq(self,seq,logtext='')
    Searches a single sequence with motifs. Based on old newSeqHit method and PrestoSeqHit object. First, a
    list of MotifOcc objects is built, then this is given to rje_motif_stats.
    >> seq:Sequence object to search
    >> logtext:str = Leading text of progress log output
Presto.setupRestrictExclude(self)
    Sets up Dictionary of Annotated Motif occurrences {Motif:[Hit,Start,Stop]}. These files must have a single
    header row and four columns containing: Motif, Hit, Start, End. Positions are given 1-N.
Presto.setupResults(self)
    Sets up headers for relative output files into self.dict['Headers'].
Presto.setupRun(self)
    Sets up input and output files etc.
Presto.uniProtHits(self)
    Processes PrestoSeqHit after search - output to uniprot-format results file.
    #!# This has not been altered since Presto Version 1.8. #!#

presto_V5 Module Methods

presto_V5.runMain()




QSLiMFinder [version 1.2] Query Short Linear Motif Finder ~ [Top]

Program: QSLiMFinder
Description: Query Short Linear Motif Finder
Version: 1.2
Last Edit: 15/09/10
Imports: rje, rje_blast, rje_seq, rje_sequence, rje_scoring, rje_xgmml, slimfinder, rje_slim, rje_slimcalc, rje_slimcore, rje_slimlist, rje_motif_V3, rje_dismatrix_V2, comparimotif_V3
Imported By: pingu
Citation: Edwards, Davey & Shields (2007), PLoS ONE 2(10): e967.
Copyright © 2008 Richard J. Edwards - See source code for GNU License Notice

Function:
QSLiMFinder is a modification of the basic SLiMFinder tool to specifically look for SLiMs shared by a query sequence
and one or more additional sequences. To do this, SLiMBuild first identifies all motifs that are present in the query
sequences before removing it (and its UPC) from the dataset. The rest of the search and stats takes place using the
remainder of the dataset but only using motifs found in the query. The final correction for multiple testing is made
using a motif space defined by the original query sequence, rather than the full potential motif space used by the
original SLiMFinder. This is offset against the increased probability of the observed motif support values due to the
reduction of support that results from removing the query sequence but could potentially still identify SLiMs will
increased significance.

Note that minocc and ambocc values *include* the query sequence, e.g. minocc=2 specifies the query and ONE other UPC.

Commandline: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Basic Input/Output Options ###
* seqin=FILE : Sequence file to search [None]
* batch=LIST : List of files to search, wildcards allowed. (Over-ruled by seqin=FILE.) [*.dat,*.fas]
* query=LIST : Return only SLiMs that occur in 1+ Query sequences (Name/AccNum/Seq Number) [1]
* addquery=FILE : Adds query sequence(s) to batch jobs from FILE [None]
* maxseq=X : Maximum number of sequences to process [500]
* maxupc=X : Maximum UPC size of dataset to process [0]
* sizesort=X : Sorts batch files by size prior to running (+1 small->big; -1 big->small; 0 none) [0]
* walltime=X : Time in hours before program will abort search and exit [1.0]
* resfile=FILE : Main SLiMFinder results table [qslimfinder.csv]
* resdir=PATH : Redirect individual output files to specified directory (and look for intermediates) [QSLiMFinder/]
* buildpath=PATH : Alternative path to look for existing intermediate files [SLiMFinder/]
* force=T/F : Force re-running of BLAST, UPC generation and SLiMBuild [False]
* pickup=T/F : Pick-up from aborted batch run by identifying datasets in resfile using RunID [False]
* dna=T/F : Whether the sequences files are DNA rather than protein [False]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### SLiMBuild Options I: Evolutionary Filtering ###
* efilter=T/F : Whether to use evolutionary filter [True]
* blastf=T/F : Use BLAST Complexity filter when determining relationships [True]
* blaste=X : BLAST e-value threshold for determining relationships [1e=4]
* altdis=FILE : Alternative all by all distance matrix for relationships [None]
* gablamdis=FILE : Alternative GABLAM results file [None] (!!!Experimental feature!!!)
* homcut=X : Max number of homologues to allow (to reduce large multi-domain families) [0]

### SLiMBuild Options II: Input Masking ###
* masking=T/F : Master control switch to turn off all masking if False [True]
* dismask=T/F : Whether to mask ordered regions (see rje_disorder for options) [False]
* consmask=T/F : Whether to use relative conservation masking [False]
* ftmask=LIST : UniProt features to mask out [EM]
* imask=LIST : UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) []
* compmask=X,Y : Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8]
* casemask=X : Mask Upper or Lower case [None]
* motifmask=X : List (or file) of motifs to mask from input sequences []
* metmask=T/F : Masks the N-terminal M (can be useful if termini=T) [True]
* posmask=LIST : Masks list of position-specific aas, where list = pos1:aas,pos2:aas [2:A]
* aamask=LIST : Masks list of AAs from all sequences (reduces alphabet) []
* qregion=X,Y : Mask all but the region of the query from (and including) residue X to residue Y [0,-1]

### SLiMBuild Options III: Basic Motif Construction ###
* termini=T/F : Whether to add termini characters (^ & $) to search sequences [True]
* minwild=X : Minimum number of consecutive wildcard positions to allow [0]
* maxwild=X : Maximum number of consecutive wildcard positions to allow [2]
* slimlen=X : Maximum length of SLiMs to return (no. non-wildcard positions) [5]
* minocc=X : Minimum number of unrelated occurrences for returned SLiMs. (Proportion of UP if < 1) [0.05]
* absmin=X : Used if minocc<1 to define absolute min. UP occ [3]
* alphahelix=T/F : Special i, i+3/4, i+7 motif discovery [False]

### SLiMBuild Options IV: Ambiguity ###
* preamb=T/F : Whether to search for ambiguous motifs during motif discovery [True]
* ambocc=X : Min. UP occurrence for subvariants of ambiguous motifs (minocc if 0 or > minocc) [0.05]
* absminamb=X : Used if ambocc<1 to define absolute min. UP occ [2]
* equiv=LIST : List (or file) of TEIRESIAS-style ambiguities to use [AGS,ILMVF,FYW,FYH,KRH,DE,ST]
* wildvar=T/F : Whether to allow variable length wildcards [True]
* combamb=T/F : Whether to search for combined amino acid degeneracy and variable wildcards [False]

### SLiMBuild Options V: Advanced Motif Filtering ###
* musthave=LIST : Returned motifs must contain one or more of the AAs in LIST (reduces search space) []
* focus=FILE : FILE containing focal groups for SLiM return (see Manual for details) [None]
* focusocc=X : Motif must appear in X+ focus groups (0 = all) [0]
* See also rje_slimcalc options for occurrence-based calculations and filtering *

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### SLiMChance Options ###
* slimchance=T/F : Execute main SLiMFinder probability method and outputs [True]
* probcut=X : Probability cut-off for returned motifs [0.1]
* maskfreq=T/F : Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [False]
* aafreq=FILE : Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
* aadimerfreq=FILE: Use empirical dimer frequencies from FILE (fasta or *.aadimer.tdt) (!!!Experimental!!!) [None]
* negatives=FILE : Multiply raw probabilities by under-representation in FILE (!!!Experimental!!!) [None]
* smearfreq=T/F : Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False]
* seqocc=T/F : Whether to upweight for multiple occurrences in same sequence (heuristic) [False]
* probscore=X : Score to be used for probability cut-off and ranking (Prob/Sig) [Sig]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Advanced Output Options I: Output data ###
* clouds=X : Identifies motif "clouds" which overlap at 2+ positions in X+ sequences (0=minocc / -1=off) [2]
* runid=X : Run ID for resfile (allows multiple runs on same data) [DATE:TIME]
* logmask=T/F : Whether to log the masking of individual sequences [True]
* slimcheck=FILE : Motif file/list to add to resfile output []

### Advanced Output Options II: Output formats ###
* teiresias=T/F : Replace TEIRESIAS, making *.out and *.mask.fasta files [False]
* slimdisc=T/F : Emulate SLiMDisc output format (*.rank & *.dat.rank + TEIRESIAS *.out & *.fasta) [False]
* extras=X : Whether to generate additional output files (alignments etc.) [1]
- 0 = No output beyond main results file
- 1 = Generate occurrence file, alignments and cloud file
- 2 = Generate all additional SLiMFinder outputs
- 3 = Generate SLiMDisc emulation too (equiv extras=2 slimdisc=T)
* targz=T/F : Whether to tar and zip dataset result files (UNIX only) [False]
* savespace=0 : Delete "unneccessary" files following run (best used with targz): [0]
- 0 = Delete no files
- 1 = Delete all bar *.upc and *.pickle files
- 2 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)

### Advanced Output Options III: Additional Motif Filtering ###
* topranks=X : Will only output top X motifs meeting probcut [1000]
* minic=X : Minimum information content for returned motifs [2.1]
* See also rje_slimcalc options for occurrence-based calculations and filtering *

Uses general modules: copy, glob, math, os, string, sys, time
Uses RJE modules: slimfinder, rje, rje_blast, rje_slim, rje_slimlist, rje_slimcalc, rje_slimcore, rje_dismatrix_V2,
rje_seq, rje_scoring
Other modules needed: None

qslimfinder Module Version History

    # 0.0 - Initial Compilation based on SLiMFinder 3.5.
    # 1.0 - Test & Modified to include AA masking.
    # 1.1 - Added sizesort.
    # 1.2 - Added the addquery function.

QSLiMFinder Class

    QSLiMFinder Class. Author: Rich Edwards (2008).

    See SLiMFinder Class for details of Attributes.    
QSLiMFinder.QUP(self,upc)


QSLiMFinder.QUPNum(self)


QSLiMFinder._setAttributes(self)


QSLiMFinder.aaDp1(self,slim,upc)
    Setup  parameters for p1+ binomial.
QSLiMFinder.addQuery(self)
    Loads and sets addQuery sequence(s).
QSLiMFinder.backupOrCreateResFile(self)
    Backups up and/or creates main results file.
QSLiMFinder.extendSLiM(self,slim)
    Finds and returns extensions of SLiM with sufficient support.
    >> slim:str = SLiM to extend (using dimers)
QSLiMFinder.focusAdjustment(self,slim)
    Adjust raw probabilities according to focus dictionary.
    >> slim:str = SLiM for probability adjustment
QSLiMFinder.getSigSlim(self,pattern)
    Returns slimcode for given pattern (should be in SigSlim).
QSLiMFinder.getSlimProb(self,slim)
    Returns appropriate SLiM Score given settings.
QSLiMFinder.makeClouds(self)
    Identifies "clouds" of motifs - similar patterns that overlap (from SigSlim).
QSLiMFinder.makeDimers(self)
    Finds all possible dimers with wildcards, using MinWild/MaxWild stat.
QSLiMFinder.makeSLiMs(self)
    Makes SLiMs with enough support from Dimers.
QSLiMFinder.myPickle(self)
    Returns pickle identifier, also used for Outputs "Build" column.
QSLiMFinder.newBatchRun(self,infile)
    Returns QSLiMFinder object for new batch run.
QSLiMFinder.pickleMe(self,load=False)
    Saves pickle for later!.
QSLiMFinder.rankScore(self)
    Scores and Ranks Sig Motifs.
QSLiMFinder.reduceDimers(self)
    Reduces Dimers to those with enough Support.
QSLiMFinder.reportQueryUPC(self)
    Reports input with UPC similarity to Query - also of interest.
QSLiMFinder.resHead(self)
    Returns main Output headers.
QSLiMFinder.run(self,batch=False)
    Main SLiMFinder Run Method:
    0. PreCheck:
        - Check for randomise function and execute if appropriate
    1. Input:
        - Read sequences into SeqList
        - or - Identify appropriate Batch datasets and rerun each with batch=True
    2. SLiMBuild:
        - Check for existing Pickle and load if found. Check appropriate parameter settings and re-run if desired.
        - or - Save sequences as fasta and mask sequences in SeqList
        -  Perform BLAST and generate UPC based on saved fasta.
        - Calculate AAFreq for each sequence and UPC.
        - Find all dimer motifs in dataset using MinWild/MaxWild parameters.
        - Extend to SLiMs and add ambiguity
    5. Identify significant SLiMs.
    6. Output results and tidy files.
    >> batch:bool [False] = whether this run is already an individual batch mode run.
QSLiMFinder.setupQueryFocus(self)
    Sets up Query based on Focus dictionary - needed for QSLiMFinder run.
    Returns True if OK, else False (which cancels run).
QSLiMFinder.slimCheck(self)
    Checks given list of Motifs and adds to output.
QSLiMFinder.slimFocus(self,slim)
    Returns True if slim if Focal sequence groups, else False.
QSLiMFinder.slimProb(self,slim)
    Calculate Probabilities for given SLiM.
QSLiMFinder.slimQUP(self,slim)




qslimfinder Module Methods

qslimfinder.patternFromCode(slim)


qslimfinder.runMain()


qslimfinder.slimDif(slim1,slim2)


qslimfinder.slimLen(slim)


qslimfinder.slimPos(slim)




qslimfinder Module ToDo Wishlist

    # [ ] : Finish and test.

SLiMDisc [version 1.4] Short, Linear Motif Discovery ~ (slimdisc_V1.4.py)[Top]

This module was not recognised by rje_pydocs.

SLiMFinder [version 4.2] Short Linear Motif Finder ~ [Top]

Program: SLiMFinder
Description: Short Linear Motif Finder
Version: 4.2
Last Edit: 04/05/11
Imports: rje, rje_blast, rje_seq, rje_sequence, rje_scoring, rje_xgmml, rje_slim, rje_slimcalc, rje_slimcore, rje_slimlist, rje_motif_V3, rje_dismatrix_V2, comparimotif_V3, ned_rankbydistribution
Imported By: pingu, qslimfinder, rje_slimhtml
Citation: Edwards, Davey & Shields (2007), PLoS ONE 2(10): e967.
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
Short linear motifs (SLiMs) in proteins are functional microdomains of fundamental importance in many biological
systems. SLiMs typically consist of a 3 to 10 amino acid stretch of the primary protein sequence, of which as few
as two sites may be important for activity, making identification of novel SLiMs extremely difficult. In particular,
it can be very difficult to distinguish a randomly recurring "motif" from a truly over-represented one. Incorporating
ambiguous amino acid positions and/or variable-length wildcard spacers between defined residues further complicates
the matter.

SLiMFinder is an integrated SLiM discovery program building on the principles of the SLiMDisc software for accounting
for evolutionary relationships [Davey NE, Shields DC & Edwards RJ (2006): Nucleic Acids Res. 34(12):3546-54].
SLiMFinder is comprised of two algorithms:

SLiMBuild identifies convergently evolved, short motifs in a dataset. Motifs with fixed amino acid positions are
identified and then combined to incorporate amino acid ambiguity and variable-length wildcard spacers. Unlike
programs such as TEIRESIAS, which return all shared patterns, SLiMBuild accelerates the process and reduces returned
motifs by explicitly screening out motifs that do not occur in enough unrelated proteins. For this, SLiMBuild uses
the "Unrelated Proteins" (UP) algorithm of SLiMDisc in which BLAST is used to identify pairwise relationships.
Proteins are then clustered according to these relationships into "Unrelated Protein Clusters" (UPCs), which are
defined such that no protein in a UPC has a BLAST-detectable relationship with a protein in another UPC. If desired,
SLiMBuild can be used as a replacement for TEIRESIAS in other software (teiresias=T slimchance=F).

SLiMChance estimates the probability of these motifs arising by chance, correcting for the size and composition of
the dataset, and assigns a significance value to each motif. Motif occurrence probabilites are calculated
independently for each UPC, adjusted the size of a UPC using the Minimum Spanning Tree algorithm from SLiMDisc. These
individual occurrence probabilities are then converted into the total probability of the seeing the observed motifs
the observed number of (unrelated) times. These probabilities assume that the motif is known before the search. In
reality, only over-represented motifs from the dataset are looked at, so these probabilities are adjusted for the
size of motif-space searched to give a significance value. This is an estimate of the probability of seeing that
motif, or another one like it. These values are calculated separately for each length of motif. Where pre-known
motifs are also of interest, these can be given with the slimcheck=MOTIFS option and will be added to the output.
SLiMFinder version 4.0 introduced a more precise (but more computationally intensive) statistical model, which can
be switched on using sigprime=T. Likewise, the more precise (but more computationally intensive) correction to the
mean UPC probability heuristic can be switched on using sigv=T. (Note that the other SLiMChance options may not
work with either of these options.) The allsig=T option will output all four scores. In this case, SigPrimeV will be
used for ranking etc. unless probscore=X is used.

Where significant motifs are returned, SLiMFinder will group them into Motif "Clouds", which consist of physically
overlapping motifs (2+ non-wildcard positions are the same in the same sequence). This provides an easy indication
of which motifs may actually be variants of a larger SLiM and should therefore be considered together.

Additional Motif Occurrence Statistics, such as motif conservation, are handled by the rje_slimlist module. Please
see the documentation for this module for a full list of commandline options. These options are currently under
development for SLiMFinder and are not fully supported. See the SLiMFinder Manual for further details. Note that the
OccFilter *does* affect the motifs returned by SLiMBuild and thus the TEIRESIAS output (as does min. IC and min.
Support) but the overall Motif StatFilter *only* affects SLiMFinder output following SLiMChance calculations.

Secondary Functions:
The "MotifSeq" option will output fasta files for a list of X:Y, where X is a motif pattern and Y is the output file.

The "Randomise" function will take a set of input datasets (as in Batch Mode) and regenerate a set of new datasets
by shuffling the UPC among datasets. Note that, at this stage, this is quite crude and may result in the final
datasets having fewer UPC due to common sequences and/or relationships between UPC clusters in different datasets.

Commandline: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Basic Input/Output Options ###
* seqin=FILE : Sequence file to search [None]
* batch=LIST : List of files to search, wildcards allowed. (Over-ruled by seqin=FILE.) [*.dat,*.fas]
* maxseq=X : Maximum number of sequences to process [500]
* maxupc=X : Maximum UPC size of dataset to process [0]
* sizesort=X : Sorts batch files by size prior to running (+1 small->big; -1 big->small; 0 none) [0]
* walltime=X : Time in hours before program will abort search and exit [1.0]
* resfile=FILE : Main SLiMFinder results table [slimfinder.csv]
* resdir=PATH : Redirect individual output files to specified directory (and look for intermediates) [SLiMFinder/]
* buildpath=PATH : Alternative path to look for existing intermediate files [SLiMFinder/]
* force=T/F : Force re-running of BLAST, UPC generation and SLiMBuild [False]
* pickup=T/F : Pick-up from aborted batch run by identifying datasets in resfile using RunID [False]
* dna=T/F : Whether the sequences files are DNA rather than protein [False]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### SLiMBuild Options I: Evolutionary Filtering ###
* efilter=T/F : Whether to use evolutionary filter [True]
* blastf=T/F : Use BLAST Complexity filter when determining relationships [True]
* blaste=X : BLAST e-value threshold for determining relationships [1e=4]
* altdis=FILE : Alternative all by all distance matrix for relationships [None]
* gablamdis=FILE : Alternative GABLAM results file [None] (!!!Experimental feature!!!)
* homcut=X : Max number of homologues to allow (to reduce large multi-domain families) [0]
* newupc=PATH : Look for alternative UPC file and calculate Significance using new clusters [None]

### SLiMBuild Options II: Input Masking ###
* masking=T/F : Master control switch to turn off all masking if False [True]
* dismask=T/F : Whether to mask ordered regions (see rje_disorder for options) [False]
* consmask=T/F : Whether to use relative conservation masking [False]
* ftmask=LIST : UniProt features to mask out [EM]
* imask=LIST : UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) []
* compmask=X,Y : Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8]
* casemask=X : Mask Upper or Lower case [None]
* motifmask=X : List (or file) of motifs to mask from input sequences []
* metmask=T/F : Masks the N-terminal M (can be useful if termini=T) [True]
* posmask=LIST : Masks list of position-specific aas, where list = pos1:aas,pos2:aas [2:A]
* aamask=LIST : Masks list of AAs from all sequences (reduces alphabet) []
* qregion=X,Y : Mask all but the region of the query from (and including) residue X to residue Y [0,-1]

### SLiMBuild Options III: Basic Motif Construction ###
* termini=T/F : Whether to add termini characters (^ & $) to search sequences [True]
* minwild=X : Minimum number of consecutive wildcard positions to allow [0]
* maxwild=X : Maximum number of consecutive wildcard positions to allow [2]
* slimlen=X : Maximum length of SLiMs to return (no. non-wildcard positions) [5]
* minocc=X : Minimum number of unrelated occurrences for returned SLiMs. (Proportion of UP if < 1) [0.05]
* absmin=X : Used if minocc<1 to define absolute min. UP occ [3]
* alphahelix=T/F : Special i, i+3/4, i+7 motif discovery [False]
* fixlen=T/F : If true, will use maxwild and slimlen to define a fixed total motif length [False]

### SLiMBuild Options IV: Ambiguity ###
* preamb=T/F : Whether to search for ambiguous motifs during motif discovery [True]
* ambocc=X : Min. UP occurrence for subvariants of ambiguous motifs (minocc if 0 or > minocc) [0.05]
* absminamb=X : Used if ambocc<1 to define absolute min. UP occ [2]
* equiv=LIST : List (or file) of TEIRESIAS-style ambiguities to use [AGS,ILMVF,FYW,FYH,KRH,DE,ST]
* wildvar=T/F : Whether to allow variable length wildcards [True]
* combamb=T/F : Whether to search for combined amino acid degeneracy and variable wildcards [False]

### SLiMBuild Options V: Advanced Motif Filtering ###
* altupc=PATH : Look for alternative UPC file and filter based on minocc [None]
* musthave=LIST : Returned motifs must contain one or more of the AAs in LIST (reduces search space) []
* query=LIST : Return only SLiMs that occur in 1+ Query sequences (Name/AccNum) []
* focus=FILE : FILE containing focal groups for SLiM return (see Manual for details) [None]
* focusocc=X : Motif must appear in X+ focus groups (0 = all) [0]
* See also rje_slimcalc options for occurrence-based calculations and filtering *

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### SLiMChance Options ###
* slimchance=T/F : Execute main SLiMFinder probability method and outputs [True]
* sigprime=T/F : Calculate more precise (but more computationally intensive) statistical model [False]
* sigv=T/F : Use the more precise (but more computationally intensive) fix to mean UPC probability [False]
* dimfreq=T/F : Whether to use dimer masking pattern to adjust number of possible sites for motif [True]
* probcut=X : Probability cut-off for returned motifs [0.1]
* maskfreq=T/F : Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [True]
* aafreq=FILE : Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
* aadimerfreq=FILE: Use empirical dimer frequencies from FILE (fasta or *.aadimer.tdt) (!!!Experimental!!!) [None]
* negatives=FILE : Multiply raw probabilities by under-representation in FILE (!!!Experimental!!!) [None]
* smearfreq=T/F : Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False]
* seqocc=T/F : Whether to upweight for multiple occurrences in same sequence (heuristic) [False]
* probscore=X : Score to be used for probability cut-off and ranking (Prob/Sig/S/R) [Sig]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Advanced Output Options I: Output data ###
* clouds=X : Identifies motif "clouds" which overlap at 2+ positions in X+ sequences (0=minocc / -1=off) [2]
* runid=X : Run ID for resfile (allows multiple runs on same data) [DATE:TIME]
* logmask=T/F : Whether to log the masking of individual sequences [True]
* slimcheck=FILE : Motif file/list to add to resfile output []

### Advanced Output Options II: Output formats ###
* teiresias=T/F : Replace TEIRESIAS, making *.out and *.mask.fasta files [False]
* slimdisc=T/F : Emulate SLiMDisc output format (*.rank & *.dat.rank + TEIRESIAS *.out & *.fasta) [False]
* extras=X : Whether to generate additional output files (alignments etc.) [1]
- 0 = No output beyond main results file
- 1 = Generate occurrence file, alignments and cloud file
- 2 = Generate all additional SLiMFinder outputs
- 3 = Generate SLiMDisc emulation too (equiv extras=2 slimdisc=T)
* targz=T/F : Whether to tar and zip dataset result files (UNIX only) [False]
* savespace=0 : Delete "unneccessary" files following run (best used with targz): [0]
- 0 = Delete no files
- 1 = Delete all bar *.upc and *.pickle and *.occ.csv files
- 2 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)

### Advanced Output Options III: Additional Motif Filtering ###
* topranks=X : Will only output top X motifs meeting probcut [1000]
* oldscores=T/F : Whether to also output old SLiMDisc score (S) and SLiMPickings score (R) [False]
* allsig=T/F : Whether to also output all SLiMChance combinations (Sig/SigV/SigPrime/SigPrimeV) [False]
* minic=X : Minimum information content for returned motifs [2.1]
* See also rje_slimcalc options for occurrence-based calculations and filtering *

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Additional Functions I: MotifSeq ###
* motifseq=LIST : Outputs fasta files for a list of X:Y, where X is the pattern and Y is the output file []
* slimbuild=T/F : Whether to build motifs with SLiMBuild. (For combination with motifseq only.) [True]

### Additional Functions II: Randomised datasets ###
* randomise=T/F : Randomise UPC within batch files and output new datasets [False]
* randir=PATH : Output path for creation of randomised datasets [Random/]
* randbase=X : Base for random dataset name [rand]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

Uses general modules: copy, glob, math, os, string, sys, time
Uses RJE modules: rje, rje_blast, rje_slim, rje_slimlist, rje_slimcalc, rje_slimcore, rje_dismatrix_V2, rje_seq, rje_scoring
Other modules needed: None

slimfinder Module Version History

    # 0.0 - Initial Compilation.
    # 1.0 - Preliminary working version with Poisson probabilities
    # 1.1 - Binomial probabilities, bonferroni corrections and complexity masking
    # 1.2 - Added musthave=LIST option and denferroni correction.
    # 1.3 - Added resfile=FILE output
    # 1.4 - Added option for termini
    # 1.5 - Reworked slim mechanics to be ai-x-aj strings for future ambiguity (split on '-' to make list)
    # 1.6 - Added basic ambiguity and flexible wildcards plus MST weighting for UP clusters
    # 1.7 - Added counting of generic dimer frequencies for improved Bonferroni and probability calculation (No blockmask.)
    #     - Added topranks=X and query=X
    # 1.8 - Added *.upc rather than *.self.blast. Added basic randomiser function.
    # 1.9 - Added MotifList object to handle extra calculations and occurrence filtering.
    # 2.0 - Tidied up and standardised output. Implemented extra filtering and scoring options.
    # 2.1 - Changed defaults. Removed poisson as option and other obseleted functions.
    # 2.2 - Tidied and reorganised code using SLiMBuild/SLiMChance subdivision of labour. Removed rerun=T/F (just Force.)
    # 2.3 - Added AAFreq "smear" and "better" p1+ calculation. Added extra cloud summary output.
    # 2.4 - Minor bug fixes and tidying. Removed power output. (Rubbish anyway!) Can read UPC from distance matrix.
    # 3.0 - Dumped useless stats and calculations. Simplified output. Improved ambiguity & clouds.
    # 3.1 - Added minwild and alphahelix options. (Partial aadimerfreq & negatives)
    # 3.2 - Tidied up with SLiMCore, replaced old Motif objects with SLiM objects and SLiMCalc.
    # 3.3 - Added XGMML output. Added webserver option with additional output.
    # 3.4 - Added consmask relative conservation masking.
    # 3.5 - Standardised masking options. Add motifmask and motifcull.
    # 3.6 - Added aamasking and alphabet.
    # 3.7 - Added option to switch off dimfreq and better handling of given aafreq
    # 3.8 - Added SLiMDisc & SLiMPickings scores and options to rank on them.
    # 3.9 - Added clouding consensus information. [Aborted due to technical challenges.]
    # 3.10- Added differentiation of methods for pickling and tarring.
    # 4.0 - Added SigPrime and SigV calculation from Norman. Added graded extras output.
    # 4.1 - Added SizeSort, AltUPC and NewUPC options. Added #END output for webserver.
    # 4.2 - Added fixlen option and improved Alphahelix option

SLiMFinder Class

    SLiMFinder Class. Author: Rich Edwards (2007).

    Info:str
    - AAFreq = Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
    - AddQuery = Adds query sequence(s) to batch jobs from FILE [None]  (QSLiMFinder only)
    - AltDis = Alternative all by all distance matrix for relationships [None]
    - AltUPC = Look for alternative UPC file and filter based on minocc [None]
    - Build = String giving summary of key SLiMBuild options
    - BuildPath = Alternative path to look for existing intermediate files [SLiMFinder/]
    - CaseMask = Mask Upper or Lower case [None]
    - Chance = String giving summary of key SLiMChance options
    - CompMask = Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8]
    - Focus = FILE containing focal groups for SLiM return (see Manual for details) [None]
    - GablamDis = Alternative GABLAM results file [None]
    - Input = Original name (and path) of input file
    - AADimerFreq = Use empirical dimer frequencies from FILE (fasta or *.dimer.tdt) [None]
    - Negatives = Multiply raw probabilities by under-representation in FILE [None]
    - NewUPC = Look for alternative UPC file and calculate Significance using new clusters [None]
    - ProbScore = Score to be used for probability cut-off and ranking (Prob/Sig) [Sig]
    - RanDir = Output path for creation of randomised datasets [./]
    - Randbase = Base for random dataset name [rand_]
    - ResDir = Redirect individual output files to specified directory [SLiMFinder/]
    - ResFile = If FILE is given, will also produce a table of results in resfile [slimfinder.csv]
    - RunID = Run ID for resfile (allows multiple runs on same data) [DATE:TIME]
    - SlimCheck = Motif file/list to add to resfile output []
    
    Opt:boolean
    - AllSig = Whether to also output all SLiMChance combinations (Sig/SigV/SigPrime/SigPrimeV) [False]
    - AlphaHelix = Special i, i+3/4, i+7 motif discovery [False]
    - CombAmb = Whether to search for combined amino acid degeneracy and variable wildcards [True]
    - DimFreq = Whether to use dimer masking pattern to adjust number of possible sites for motif [True]
    - DisMask = Whether to mask ordered regions (see rje_disorder for options) [False]
    - DNA = Whether the sequences files are DNA rather than protein [False]
    - Force = whether to force recreation of key files [False]
    - EFilter = Whether to use evolutionary filter [True]
    - Extras = Whether to generate additional output files (alignments etc.) [False]
    - FixLen = If true, will use maxwild and slimlen to define a fixed total motif length  [False]
    - LogMask = Whether to log the masking of individual sequences [True]
    - OldScores = Whether to also output old SLiMDisc score (S) and SLiMPickings score (R) [False]
    - Masked = Whether dataset has been masked [False]
    - Masking = Master control switch to turn off all masking if False [True]
    - MaskM = Masks the N-terminal M (can be useful if termini=T) [False]
    - PreAmb = Whether to search for ambiguous motifs during motif discovery [False]
    - OccStatsCalculated = Whether OccStats have been calculated for all occurrence [False]
    - MaskFreq = Whether to mask input before any analysis, or after frequency calculations [True]
    - SigV = Use the more precise (but more computationally intensive) fix to mean UPC probability [False]
    - Randomise = Randomise UPC within batch files [False]
    - SeqOcc = Whether to upweight for multiple occurrences in same sequence [True]
    - SigPrime = Calculate more precise (but more computationally intensive) statistical model [False]
    - SlimBuild = Whether to build motifs with SLiMBuild. (For combination with motifseq only.) [True]
    - SlimDisc = Output in SLiMDisc format (*.rank & *.dat.rank) [False]
    - SlimChance = Output in SLiMFinder Format (*.rank & *.occ.txt) [True]
    - SmearFreq = Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False]
    - TarGZ = Whether to tar and zip dataset result files (UNIX only) [False]
    - Teiresias = Replace TEIRESIAS only, making *.out and *.mask.fas files [False]
    - Termini = Whether to add termini characters (^ & $) to search sequences [True]
    - Test = Special Test parameter for experimentation with code [False]
    - WildVar = Whether to allow variable length wildcards [False]
    - Pickup = Pick-up from aborted batch run by identifying last dataset output in resfile [False]
    - Webserver = Generate additional webserver-specific output [False]
    - TempMaxSetting = Whether Maxupc and Maxseq are temporary and skipped datasets should be left out of results

    Stat:numeric
    - AAMask = Masks list of AAs from all sequences (reduces alphabet) []
    - Alphabet = List of characters to include in search (e.g. AAs or NTs) 
    - AbsMin = Used if minocc<1 to define absolute min. UP occ [3]
    - AbsMinAmb = Used if ambocc<1 to define absolute min. UP occ [2]    
    - AmbOcc = Min. UP occurrence for subvariants of ambiguous motifs (minocc if 0 or > minocc) [0]
    - Clouds = Identifies motif "clouds" which overlap at 2+ positions in X+ sequences (0=minocc) [2]
    - Extras = Whether to generate additional output files (alignments etc.) [1]
    - FocusOcc = Motif must appear in X+ focus groups (0 = all) [0]
    - Minwild = Minimum number of consecutive wildcard positions to allow [0]
    - MaxWild = Maximum number of consecutive wildcard positions to allow [3]
    - SlimLen = Maximum length of SLiMs to return (no. non-wildcard positions) [5]
    - MaxSeq = Maximum number of sequences to process [500]
    - MaxUPC = Maximum UPC size of dataset to process [0]
    - MinIC = Minimum information content for returned motifs [1.1]
    - MinOcc = Minimum number of unrelated occurrences for returned SLiMs [2]
    - MST = MST corrected size for whole dataset
    - ProbCut = Probability cut-off for returned motifs [0.01]
    - SaveSpace = Delete "unneccessary" files following run (see Manual for details) [0]
    - SizeSort = Sorts batch files by size prior to running (+1 small->big; -1 big->small; 0 none) [0]
    - StartTime = Starting time in seconds (for output when using shared log file)
    - TopRanks = Will only output top X motifs meeting probcut [0]
    - WallTime = Time in hours before program will terminate [1.0]

    List:list
    - Batch = List of files to search, wildcards allowed. (Over-ruled by seqin=FILE.) [*.dat,*.fas]
    - Equiv = List (or file) of TEIRESIAS-style ambiguities to use [AGS,ILMV,FYW,KRH,DE,ST]
    - FTMask = UniProt features to mask out [EM,DOMAIN,TRANSMEM]
    - Headers = Headers for main SLiMFinder output table
    - HomCut = Max number of homologues to allow (to reduce large multi-domain families) [0]
    - IMask = UniProt features to inversely ("inclusively") mask [IM]
    - MustHave = Returned motifs must contain one or more of the AAs in LIST (reduces search space) []
    - OccScale = Rescale probabilities according to occstats (see manual for details) []
    - Query = Return only SLiMs that occur in 1+ Query sequences (Name/AccNum) []
    - SigSlim = List of significant SLiMs - matches keys to self.dict['Slim(Freq)'] - *in rank order*
    - SlimCheckExtra = List of extra SLiMs from SLiMCheck - added to extra outputs
    - UP = List of UP cluster tuples
    - Warning = List of text (log) warnings to reproduce at end of run
     
    Dict:dictionary
    - AADimerFreq = Empirical dimer frequencies {i:{x:{j:freq}}}
    - AAFreq = AA frequency dictionary for each seq / UPC
    - Clouds = Dictionary of Motif Clouds made by the makeClouds() method {ID':{stats}}
    - DimFreq = Frequency of dimers of each X length per upc {upc:[freqs]}
    - Dimers = main nested dictionary for SLiMFinder {Ai:{X:{Aj:{'UP':[UPC],'Occ':[(Seq,Pos)]}}}}
    - ElementIC = dictionary of {Motif Element:Information Content}
    - Focus = Dictionary of {focal group:list of seqs}
    - Extremf. = Dictionary of {length:extremferroni correction}
    - MaskPos = Masks list of position-specific aas, where list = pos1:aas,pos2:aas  [2:A]
    - MotifSeq = Dictionary of {pattern:output file for sequences}
    - MST = MST corrected size for UPC {UPC:MST}
    - Slim = main dictionary containing SLiMs with enough support {Slim:{'UPC':[UPC],'Occ':[Seq,Pos]}}
    - SeqOcc = dictionary of {Slim:{Seq:Count}} 

    Obj:RJE_Objects
    - SlimCheck = MotifList object handling motifs to check
    - SeqList = main SeqList object containing dataset to be searched
    - SlimList = SLiMList object handling motif stats and filtering options
SLiMFinder._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
SLiMFinder._remakeSLiMDictUP(self)
    Remakes SLiM UP dictionaries from SLiM Occ.
SLiMFinder._setAttributes(self)
    Sets Attributes of Object.
SLiMFinder.aaDp1(self,slim,upc)
    Setup  parameters for p1+ binomial.
SLiMFinder.addAmb(self,prevslim,equivlist,type='Combined')
    Combines SLiMs from prevslim list into ambiguous SLiMs.
SLiMFinder.adjustAATotals(self)
    will be exactly as they were. If maskfreq=F, however, frequencies will need to be adjust for the new number of
    masked and non-masked amino acids.'''
    try:
        ### Simple PreMasking Procedure ###
        if self.opt['MaskFreq']:
        for upc in self.list['UP']:
            if self.dict['AAFreq'][upc].has_key('X'): self.dict['AAFreq'][upc].pop('X') # Ignore Xs
            if self.opt['DNA'] and self.dict['AAFreq'][upc].has_key('N'): self.dict['AAFreq'][upc].pop('N') # Ignore Ns from DNA sequences
            self.dict['AAFreq'][upc].pop('Total')  #!# Total remade by dictFreq #!#
            if self.opt['Termini']:
            self.dict['AAFreq'][upc]['^'] = len(upc)
            self.dict['AAFreq'][upc]['$'] = len(upc)
            rje.dictFreq(self.dict['AAFreq'][upc])
            ## Make MST adjustments for UPC ##
            self.dict['AAFreq'][upc]['Total'] = int(0.5+(self.dict['AAFreq'][upc]['Total']*self.dict['MST'][upc]))
        if self.opt['SmearFreq']: self.smearAAFreq()
        return

        ### More complicated PostMasking Procedure ###
        (prex,postx) = (0,0)
        for upc in self.list['UP']:
        x = 0
        if self.dict['AAFreq'][upc].has_key('X'): x = self.dict['AAFreq'][upc].pop('X') # Ignore Xs
        if self.opt['DNA'] and self.dict['AAFreq'][upc].has_key('N'): x += self.dict['AAFreq'][upc].pop('N') # Ignore Ns from DNA sequences
        totalaa = self.dict['AAFreq'][upc].pop('Total')
        preaa = totalaa - x   # Want to calculate new total
        prex += preaa
        nonx = 0.0
        for seq in upc: nonx += seq.aaLen() - seq.info['Sequence'].count('X')
        rje.dictFreq(self.dict['AAFreq'][upc])
        ## Termini ##
        if self.opt['Termini']:
            nonx += 2 * len(upc)
            self.dict['AAFreq'][upc]['^'] = len(upc) / nonx
            self.dict['AAFreq'][upc]['$'] = len(upc) / nonx
        ## Make MST adjustments for UPC ##
        self.dict['AAFreq'][upc]['Total'] = int(0.5+(nonx*self.dict['MST'][upc])) 
        postx += self.dict['AAFreq'][upc]['Total']
            
        ### Finish ###
        if self.info['AAFreq'].lower() not in ['','none']: prex = self.dict['AAFreq']['Dataset']['Total']
        if self.opt['DNA']: self.printLog('#ADJ','Effective dataset size reduced from %s nt to %s nt' % (rje.integerString(prex),rje.integerString(postx)))
        else: self.printLog('#ADJ','Effective dataset size reduced from %s AA to %s AA' % (rje.integerString(prex),rje.integerString(postx)))
        if self.opt['SmearFreq']: self.smearAAFreq()
    except:
        self.log.errorLog('Problem during %s.adjustAATotals()' % self.prog())
        raise
#########################################################################################################################
SLiMFinder.ambSLiM(self,prevslim)
    Combines SLiMs from prevslim list into ambiguous SLiMs.
SLiMFinder.backupOrCreateResFile(self)
    Backups up and/or creates main results file.
SLiMFinder.chanceText(self)
    Makes/Returns self.info['Chance'].
SLiMFinder.cloudConsensi(self)
    Generate consensus motifs from clouds.
SLiMFinder.cloudOutput(self)
    Generates motif cloud output.
SLiMFinder.cloudXGMML(self,xgmml)
    Converts full XGMML into Cloud XGMMML.
SLiMFinder.combineAmb(self,occvar)
    Combines motif variants in occvar into an ambiguous motif, incorporating "missing" variants as appropriate.
SLiMFinder.extendSLiM(self,slim)
    Finds and returns extensions of SLiM with sufficient support.
    >> slim:str = SLiM to extend (using dimers)
SLiMFinder.extraOutput(self)
    Method controlling additional outputs (primarily MotifList alignments):
    - Full protein alignments (in subdirectory) with orthlogues (no masking)
    - Protein Alignments with Motifs and masking marked
    - Motif alignments with and without masking
    - Occurrence data table
    - CompariMotif analysis?
SLiMFinder.filterSLiMs(self)
    Filters SLiMs on non-SLiMChance parameters.
SLiMFinder.focusAdjustment(self,slim)
    Adjust raw probabilities according to focus dictionary.
    >> slim:str = SLiM for probability adjustment
SLiMFinder.getSigSlim(self,pattern)
    Returns slimcode for given pattern (should be in SigSlim).
SLiMFinder.getSlimProb(self,slim)
    Returns appropriate SLiM Score given settings.
SLiMFinder.loadAADimerFreq(self)
    Calculates/loads AA Dimer Frequencies.
SLiMFinder.makeBonferroni(self)
    Calculates Bonferroni corrections number for dataset.
SLiMFinder.makeClouds(self)
    Identifies "clouds" of motifs - similar patterns that overlap (from SigSlim).
SLiMFinder.makeDimers(self)
    Finds all possible dimers with wildcards, using MinWild/MaxWild stat.
SLiMFinder.makeSLiMs(self)
    Makes SLiMs with enough support from Dimers.
SLiMFinder.motifOccStats(self,slim)
    Calculates Motif OccStats and Filters Occ if appropriate. Returns Motif/None if slim still OK/Not.
SLiMFinder.mustHave(self,slim)
    Looks at SLiM w.r.t. MustHave list and returns True/False if OK or not.
SLiMFinder.myPickle(self)
    Returns pickle identifier, also used for Outputs "Build" column.
SLiMFinder.newBatchRun(self,infile)
    Returns SLiMFinder object for new batch run.
SLiMFinder.pickleMe(self,load=False)
    Saves pickle for later!.
SLiMFinder.rankScore(self)
    Scores and Ranks Sig Motifs.
SLiMFinder.reduceDimers(self)
    Reduces Dimers to those with enough Support.
SLiMFinder.resHead(self)
    Returns main Output headers.
SLiMFinder.results(self,aborted=None)
    Main SLiMFinder Results Output.
SLiMFinder.run(self,batch=False)
    Main SLiMFinder Run Method:
    0. PreCheck:
        - Check for randomise function and execute if appropriate
    1. Input:
        - Read sequences into SeqList
        - or - Identify appropriate Batch datasets and rerun each with batch=True
    2. SLiMBuild:
        - Check for existing Pickle and load if found. Check appropriate parameter settings and re-run if desired.
        - or - Save sequences as fasta and mask sequences in SeqList
        -  Perform BLAST and generate UPC based on saved fasta.
        - Calculate AAFreq for each sequence and UPC.
        - Find all dimer motifs in dataset using MinWild/MaxWild parameters.
        - Extend to SLiMs and add ambiguity
    5. Identify significant SLiMs.
    6. Output results and tidy files.
    >> batch:bool [False] = whether this run is already an individual batch mode run.
SLiMFinder.setupFocus(self)
    Sets up Focus dictionary. Returns True if OK, else False (which cancels run).
SLiMFinder.setupMinOcc(self)
    Adjusts MinOcc and AmbOcc settings according to UPNum etc.
SLiMFinder.setupResults(self)
    Sets up Main Results File as well as StatFilters etc.
SLiMFinder.setupXGMML(self,xgmml=None)
    Adds proteins and UPC to XGMML (new one if no object given).
SLiMFinder.sigSlim(self,slim,calculate=True,rankfilter=True)
    Adds SLiM to self.list['SigSlim'] providing it meets requirements.
SLiMFinder.slimChance(self)
    SLiMChance Probability and Significance Calculations.
SLiMFinder.slimChanceNorman(self,fixes,replacesig=False)
    Calculates special corrected significance of Norman Davey.
SLiMFinder.slimCheck(self)
    Checks given list of Motifs and adds to output.
SLiMFinder.slimFocus(self,slim)
    Returns True if slim if Focal sequence groups, else False.
SLiMFinder.slimNum(self)


SLiMFinder.slimOccNum(self,slim,upc=None)
    Returns number of occ of Slim in given UPC.
SLiMFinder.slimProb(self,slim)
    Calculate Probabilities for given SLiM.
SLiMFinder.slimSeqNum(self,slim)
    Returns the number of sequences SLiM occurs in.
SLiMFinder.slimUP(self,slim)


SLiMFinder.teiresias(self)
    Output in TEIRESIAS format along with masked sequence (input) fasta file.

slimfinder Module Methods

slimfinder.patternFromCode(slim)


slimfinder.runMain()


slimfinder.slimDif(slim1,slim2)


slimfinder.slimLen(slim)


slimfinder.slimPos(slim)




slimfinder Module ToDo Wishlist

    # [ ] : Interactive user Menu
    # [ ] : Add iterative search mode that re-searches with remaining sequences
    # [Y] : Improve the use of the pickling - check any options that might change output etc.
    # [?] : Add generation of data summary from Log file (from rje_misc) and tidy resfile=FILE output
    # [Y] : Incorporate MotifList object for ease of use of methods (e.g. IC) and future conservation etc.?
    # [Y] : Check and define when SeqOcc is used - used for calculating "Support" column
    # [ ] : Consider adding SlimCheck SLiMs to the alignment outputs - will need to find occurrences with PRESTO.
    # [ ] : Consider adding an MST file for the Focus Group that can be read like the UPC
    # [x] : Add gablam=FILE option to read gablam results from a file (and generate this file first if missing).
    # [Y] : Replace the rje_motif Expectation calculation for p1+ with an internal calculation.
    # [Y] : Add the number of sequences (absolute support) to output
    # [?] : Add positional variation from Nterm and Cterm to output? (Can calculate from Occ File)
    # [Y] : Add an option to "smear" aa frequencies over all UPC & output masked sequences in masked.fas (extras=T).
    # [Y] : Add gzipping of pickles in UNIX.
    # [Y] : Remove Normferroni and Occferroni measures. (Crap!)
    # [Y] : Make improved use of Query and Focus groups - binomial-based ajustment rather than reduced dataset size.
    # [ ] : Add line to output if walltime reached.
    # [ ] : Code-up negatives option - use MotifSeq as base?
    # [~] : Move stats and filtering on stats out into slimcalc and correctly use SLiMList.
    # [ ] : Add occscale=LIST : Rescale probabilities according to occstats []
    # [ ] : Add palindrome option?
    # [ ] : Add output compatible with Norman's motif drawing thingamie.
    # [ ] : Check what motifcull is and whether it is implemented.
    # [?] : Fix the bug that outputs some occurrences twice. (Variable wildcards only.)

SLiMPred [version 3.0] Short Linear Motif Prediction tool ~ [Top]

Program: SLiMPred
Description: Short Linear Motif Prediction tool
Version: 3.0
Last Edit: 29/11/11
Imports:
Imported By: None
Copyright © 2011 Niall J. Haslam - See source code for GNU License Notice

slimpred Module Methods

slimpred.SLiMPred(infile, outfile, threshold)


slimpred.readIniFile(iniFile, iniValues)


slimpred.setInFile(inFile, slimPiniValues)


slimpred.setOutFile(outfile, slimPiniValues)


slimpred.setThreshold(threshold, slimPiniValues)




SLiMPrints [version 3.0] Short Linear Motif fingerPrints ~ (slimprints3.py)[Top]

Program: SLiMPrints
Description: Short Linear Motif fingerPrints
Version: 3.0
Last Edit: 29/11/11
Imports: ned_SLiMPrints, ned_SLiMPrints_Tester, ned_commandLine, ned_proteinInfoHelper, ned_conservationScorer, ned_basic
Imported By: None
Copyright © 2011 Norman E. Davey - See source code for GNU License Notice

SLiMSearch [version 1.5] Short Linear Motif Search tool ~ [Top]

Program: SLiMSearch
Description: Short Linear Motif Search tool
Version: 1.5
Last Edit: 03/06/10
Imports: rje, rje_blast, rje_seq, rje_sequence, rje_scoring, rje_slim, rje_slimcore, rje_slimcalc, rje_slimlist, rje_zen, rje_dismatrix_V2
Imported By: None
Citation: Davey, Haslam, Shields & Edwards (2010), Lecture Notes in Bioinformatics 6282: 50-61.
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
SLiMSearch is a tool for finding pre-defined SLiMs (Short Linear Motifs) in a protein sequence database. SLiMSearch
can make use of corrections for evolutionary relationships and a variation of the SLiMChance alogrithm from
SLiMFinder to assess motifs for statistical over- and under-representation. SLiMSearch is a replacement for PRESTO
and uses many of the same underlying modules.

Benefits of SLiMSearch that make it more useful than a lot of existing tools include:
* searching with mismatches rather than restricting hits to perfect matches.
* optional equivalency files for searching with specific allowed mismatched (e.g. charge conservation)
* generation or reading of alignment files from which to calculate conservation statistics for motif occurrences.
* additional statistics, inlcuding protein disorder, surface accessibility and hydrophobicity predictions
* recognition of "n of m" motif elements in the form , where X is one or more amino acids that must occur n+
times across which m positions. E.g. must have 3+ Is and/or Ls in a 5aa stretch.

Main output for SLiMSearch is a delimited file of motif/peptide occurrences but the motifaln=T and proteinaln=T also
allow output of alignments of motifs and their occurrences.

Commandline: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Basic Input/Output Options ###
* motifs=FILE : File of input motifs/peptides [None]
Single line per motif format = 'Name Sequence #Comments' (Comments are optional and ignored)
Alternative formats include fasta, SLiMDisc output and raw motif lists.
* seqin=FILE : Sequence file to search [None]
* batch=LIST : List of sequence files for batch input (wildcard * permitted) []
* maxseq=X : Maximum number of sequences to process [0]
* maxsize=X : Maximum dataset size to process in AA (or NT) [100,000]
* maxocc=X : Filter out Motifs with more than maximum number of occurrences [0]
* walltime=X : Time in hours before program will abort search and exit [1.0]
* resfile=FILE : Main SLiMSearch results table [slimsearch.csv]
* resdir=PATH : Redirect individual output files to specified directory (and look for intermediates) [SLiMSearch/]
* buildpath=PATH : Alternative path to look for existing intermediate files [SLiMSearch/]
* force=T/F : Force re-running of BLAST, UPC generation and search [False]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### SearchDB Options I: Input Protein Sequence Masking ###
* masking=T/F : Master control switch to turn off all masking if False [False]
* dismask=T/F : Whether to mask ordered regions (see rje_disorder for options) [False]
* consmask=T/F : Whether to use relative conservation masking [False]
* ftmask=LIST : UniProt features to mask out [EM,DOMAIN,TRANSMEM]
* imask=LIST : UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) []
* compmask=X,Y : Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8]
* casemask=X : Mask Upper or Lower case [None]
* motifmask=X : List (or file) of motifs to mask from input sequences []
* metmask=T/F : Masks the N-terminal M [False]
* posmask=LIST : Masks list of position-specific aas, where list = pos1:aas,pos2:aas [2:A]
* aamask=LIST : Masks list of AAs from all sequences (reduces alphabet) []

### SearchDB Options II: Evolutionary Filtering ###
* efilter=T/F : Whether to use evolutionary filter [False]
* blastf=T/F : Use BLAST Complexity filter when determining relationships [True]
* blaste=X : BLAST e-value threshold for determining relationships [1e=4]
* altdis=FILE : Alternative all by all distance matrix for relationships [None]
* gablamdis=FILE : Alternative GABLAM results file [None] (!!!Experimental feature!!!)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### SLiMChance Options ###
* maskfreq=T/F : Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [True]
* aafreq=FILE : Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
* aadimerfreq=FILE: Use empirical dimer frequencies from FILE (fasta or *.aadimer.tdt) [None]
* negatives=FILE : Multiply raw probabilities by under-representation in FILE [None]
* background=FILE : Use observed support in background file for over-representation calculations [None]
* smearfreq=T/F : Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False]
* seqocc=X : Restrict to sequences with X+ occurrences (adjust for high frequency SLiMs) [1]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Output Options ###
* extras=T/F : Whether to generate additional output files (alignments etc.) [True]
* pickle=T/F : Whether to save/use pickles [True]
* targz=T/F : Whether to tar and zip dataset result files (UNIX only) [False]
* savespace=0 : Delete "unneccessary" files following run (best used with targz): [0]
- 0 = Delete no files
- 1 = Delete all bar *.upc and *.pickle files
- 2 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)
* See also rje_slimcalc options for occurrence-based calculations and filtering *
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje
Other modules needed: None

slimsearch Module Version History

    # 0.0 - Initial Compilation.
    # 1.0 - Standardised masking options. Still not fully tested.
    # 1.1 - Added background=FILE option for determing mean(p1+) for SLiMs based on background file.
    # 1.2 - Added maxsize option.
    # 1.3 - Add aamask option (and alphabet)
    # 1.4 - Fixed zero-size UPC bug.
    # 1.5 - Add MaxOcc setting.

SLiMSearch Class

    Short Linear Motif Regular Expression Search Tool Class. Author: Rich Edwards (2007). This module inherits the
    SLiMFinder class from SLiMFinder, which handles the masking of datasets and correcting for evolutionary relationships
    (if this desired).

    Info:str
    - AAFreq = Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
    - AltDis = Alternative all by all distance matrix for relationships [None]
    - Background = Use observed support in background file for over-representation calculations [None]
    - BuildPath = Alternative path to look for existing intermediate files [SLiMSearch/]
    - CaseMask = Mask Upper or Lower case [None]
    - CompMask = Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8]
    - GablamDis = Alternative GABLAM results file [None]
    - Input = Original name (and path) of input file
    - ResDir = Redirect individual output files to specified directory [SLiMSearch/]
    - ResFile = If FILE is given, will also produce a table of results in resfile [slimsearch.csv]
    - RunID = Run ID for resfile (allows multiple runs on same data) [DATE:TIME]
    
    Opt:boolean
    - DisMask = Whether to mask ordered regions (see rje_disorder for options) [False]
    - Force = whether to force recreation of key files [False]
    - EFilter = Whether to use evolutionary filter [True]
    - Extras = Whether to generate additional output files (alignments etc.) [False]
    - LogMask = Whether to log the masking of individual sequences [True]
    - Masked = Whether dataset has been masked [False]
    - Masking = Master control switch to turn off all masking if False [True]
    - MaskM = Masks the N-terminal M (can be useful if termini=T) [False]
    - MaskFreq = Whether to mask input before any analysis, or after frequency calculations [True]
    - Pickle = Whether to save/use pickles [True]
    - SmearFreq = Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False]
    - TarGZ = Whether to tar and zip dataset result files (UNIX only) [False]
    - Teiresias = Replace TEIRESIAS only, making *.out and *.mask.fas files [False]

    Stat:numeric
    - MaxOcc = Filter out Motifs with more than maximum number of occurrences [0]
    - MaxSeq = Maximum number of sequences to process [500]
    - MaxSize = Maximum dataset size to process in AA (or NT) [1e5]
    - MST = MST corrected size for whole dataset
    - SaveSpace = Delete "unneccessary" files following run (see Manual for details) [0]
    - SeqOcc = Restrict to sequences with X+ occurrences (adjust for high frequency SLiMs) [1]
    - StartTime = Starting time in seconds (for output when using shared log file)
    - WallTime = Time in hours before program will terminate [1.0]

    List:list
    - AAMask = Masks list of AAs from all sequences (reduces alphabet) []
    - Alphabet = List of characters to include in search (e.g. AAs or NTs) 
    - Batch = List of files to search, wildcards allowed. (Over-ruled by seqin=FILE.) [*.dat,*.fas]
    - FTMask = UniProt features to mask out [EM,DOMAIN,TRANSMEM]
    - Headers = Headers for main output table
    - IMask = UniProt features to inversely ("inclusively") mask [IM]
    - NoHits = List of sequences without any motif hits []
    - OccHeaders = Headers for main occurrence output table
    - UP = List of UP cluster tuples
     
    Dict:dictionary
    - AAFreq = AA frequency dictionary for each seq / UPC
    - MaskPos = Masks list of position-specific aas, where list = pos1:aas,pos2:aas  [2:A]
    - MST = MST corrected size for UPC {UPC:MST}
    - P1 = Probabilities of 1+ occurrences {SLiM:{Seq/UPC:p}}

    Obj:RJE_Objects
    - Background = Background SLiMSearch object containing data for background occurrences
    - SeqList = main SeqList object containing dataset to be searched
    - SlimList = MotifList object handling motif stats and filtering options
SLiMSearch._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
SLiMSearch._setAttributes(self)
    Sets Attributes of Object.
SLiMSearch.adjustAATotals(self)
    will be exactly as they were. If maskfreq=F, however, frequencies will need to be adjust for the new number of
    masked and non-masked amino acids.'''
    try:
        ### Simple PreMasking Procedure ###
        if self.opt['MaskFreq']:
        for upc in self.list['UP']:
            if self.dict['AAFreq'][upc].has_key('X'): self.dict['AAFreq'][upc].pop('X') # Ignore Xs
            self.dict['AAFreq'][upc].pop('Total')  #!# Total remade by dictFreq #!#
            self.dict['AAFreq'][upc]['^'] = 0
            self.dict['AAFreq'][upc]['$'] = 0
            rje.dictFreq(self.dict['AAFreq'][upc])
            ## Make MST adjustments for UPC ##
            self.dict['AAFreq'][upc]['Total'] = int(0.5+(self.dict['AAFreq'][upc]['Total']*self.dict['MST'][upc]))
            if self.dict['AAFreq'][upc]['Total']:
            self.dict['AAFreq'][upc]['^'] = self.dict['MST'][upc] / float(self.dict['AAFreq'][upc]['Total'])
            self.dict['AAFreq'][upc]['$'] = self.dict['MST'][upc] / float(self.dict['AAFreq'][upc]['Total'])
            else:
            self.dict['AAFreq'][upc]['^'] = 0.5
            self.dict['AAFreq'][upc]['$'] = 0.5
        if self.opt['SmearFreq']: self.smearAAFreq()
        return

        ### More complicated PostMasking Procedure ###
        (prex,postx) = (0,0)
        for upc in self.list['UP']:
        x = 0
        if self.dict['AAFreq'][upc].has_key('X'): x = self.dict['AAFreq'][upc].pop('X') # Ignore Xs
        totalaa = self.dict['AAFreq'][upc].pop('Total')
        preaa = totalaa - x   # Want to calculate new total
        prex += preaa
        nonx = 0.0
        for seq in upc: nonx += seq.nonX()
        ## Termini ##
        self.dict['AAFreq'][upc]['^'] = 0
        self.dict['AAFreq'][upc]['$'] = 0
        rje.dictFreq(self.dict['AAFreq'][upc])
        ## Make MST adjustments for UPC ##
        self.dict['AAFreq'][upc]['Total'] = int(0.5+(nonx*self.dict['MST'][upc])) 
        postx += self.dict['AAFreq'][upc]['Total']
        ## Termini ##
        if nonx:
            self.dict['AAFreq'][upc]['^'] = self.dict['MST'][upc] / nonx
            self.dict['AAFreq'][upc]['$'] = self.dict['MST'][upc] / nonx
        else:
            self.dict['AAFreq'][upc]['^'] = 0.5
            self.dict['AAFreq'][upc]['$'] = 0.5
            
        ### Finish ###
        if self.info['AAFreq'].lower() not in ['','none']: prex = self.dict['AAFreq']['Dataset']['Total']
        if prex != postx: self.printLog('#ADJ','Effective dataset size reduced from %s %s to %s %s' % (rje.integerString(prex),self.units(),rje.integerString(postx),self.units()))
        if self.opt['SmearFreq']: self.smearAAFreq()
    except:
        self.log.errorLog('Problem during SLiMSearch.adjustAATotals()')
        raise
#########################################################################################################################
#CORE# def smearAAFreq(self):  ### Equalises AAFreq across UPC. Leaves Totals unchanged.
#########################################################################################################################
SLiMSearch.background(self)
    Sets up and/or returns Background SLiMSearch object.
SLiMSearch.backupOrCreateResFile(self)
    Backups up and/or creates main results file.
SLiMSearch.calculateOccAttributes(self)
    Executes rje_slimcalc calculations via rje_slimlist object.
SLiMSearch.combMotifOccStats(self)
    Combined occurrence stats for each Motif and summary output.
SLiMSearch.extraOutput(self)
    Method controlling additional outputs (primarily MotifList alignments):
    - Full protein alignments (in subdirectory) with orthlogues (no masking)
    - Protein Alignments with Motifs and masking marked
    - Motif alignments with and without masking
SLiMSearch.myPickle(self)


SLiMSearch.newBatchRun(self,infile)
    Returns SLiMFinder object for new batch run.
SLiMSearch.pickleMe(self,load=False)
    Saves pickle for later!.
SLiMSearch.processSeqOccs(self)
    Processes Occurrences after search - output to results file.
SLiMSearch.resHead(self,htype='Headers')
    Returns main Output headers.
SLiMSearch.run(self,batch=False)
    Main SLiMSearch Run Method:
    1. Input:
        - Read sequences into SeqList
        - or - Identify appropriate Batch datasets and rerun each with batch=True
    2. Generate UPC:
        - Check for existing UPC load if found. 
        - or - Save sequences as fasta and mask sequences in SeqList
        -  Perform BLAST and generate UPC based on saved fasta.
        - Calculate AAFreq for each sequence and UPC.
    3. Perform SLiM Search.
    4. Output results and tidy files.
    >> batch:bool [False] = whether this run is already an individual batch mode run.
SLiMSearch.searchDB(self)
    Main method for searching database with motifs. This just does basic matches. Stats etc. dealt with later.
SLiMSearch.setupBasefile(self)
    Sets up self.info['Basefile'].
SLiMSearch.setupResults(self)
    Sets up Main Results File as well as StatFilters etc.
SLiMSearch.slimIC(self,slim)


SLiMSearch.slimNum(self)


SLiMSearch.slimOccNum(self,slim)


SLiMSearch.slimProb(self,slim)
    Calculate Probabilities for given SLiM. Modified from slimcore for SLiMSearch.
SLiMSearch.slimProbBG(self,slim)
    Calculate Probabilities for given SLiM using Background Support.
SLiMSearch.slimSeqNum(self,slim)


SLiMSearch.slimUP(self,slim)
    Returns UP Num SLiM.
SLiMSearch.slims(self)


SLiMSearch.teiresias(self)
    Output in TEIRESIAS format along with masked sequence (input) fasta file.

slimsearch Module Methods

slimsearch.runMain()




slimsearch Module ToDo Wishlist

    # [ ] : Finish implementation with current methods/options!
    # [ ] : Add occurrence and SLiM filtering.
    # [ ] : Reinstate the expcut=X option for filtering motifs based on expected occurrences.
    # [ ] : Remove ORTHALN and make mapping like SLiMFinder
    # [ ] : Add "startfrom" option.

UniFake [version 1.2] Fake UniProt DAT File Generator ~ [Top]

Program: UniFake
Description: Fake UniProt DAT File Generator
Version: 1.2
Last Edit: 10/10/08
Imports: rje, rje_hmm, rje_seq, rje_tm, rje_uniprot, rje_zen
Imported By: rje_slimcore
Copyright © 2008 Richard J. Edwards - See source code for GNU License Notice

Function:
This program runs a number of in silico predication programs and converts protein sequences into a fake UniProt DAT
flat file. Additional features may be given as one or more tables, using the features=LIST option. Please see the
UniFake Manual for more details.

Commandline:
### ~ INPUT OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
* seqin=FILE : Input sequence file. See rje_seq documentation for filtering options. [None]
* spcode=X : Species code to use if it cannot be established from sequence name [None]
* features=LIST : List of files of addtional features in delimited form []
* aliases=FILE : File of aliases to be added to Accession number list (for indexing) [None]
* pfam=FILE : PFam HMM download [None]
* unipath=PATH : Path to real UniProt Datafile (will look here for DB Index file made with rje_dbase)
* unireal=LIST : Real UniProt data to add to UniFake output ['AC','GN','RC','RX','CC','DR','PE','KW']
* fudgeft=T/F : Fudge the real features left/right until a sequence match is found [True]

### ~ PROCESSING/OUTPUT OPTIONS ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
* unifake=LIST : List of predictions to add to entries [tmhmm,signalp,disorder,pfam,uniprot]
* datout=FILE : Name of output DAT file [Default input FILE.dat]
* disdom=X : Disorder threshold below which to annotate PFam domain as "DOMAIN" [0.0]
* makeindex=T/F : Whether to make a uniprot index file following run [False]
* ensdat=T/F : Look for acc/pep/gene in sequence name [False]
* tmhmm=FILE : Path to TMHMM program [None]
* cleanup=T/F : Remove TMHMM files after run [True]
* signalp=FILE : Path to SignalP program [None]

Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje
Other modules needed: None

unifake Module Version History

    # 0.0 - Initial Compilation.
    # 1.0 - Version with full functionality.
    # 1.1 - Made changes to temp file names for parallel running
    # 1.2 - Added option for reading in Features and limited other info from real UniProt entry

UniFake Class

    UniFake DAT Generator Class. Author: Rich Edwards (2008).

    Info:str
    - Aliases = File of aliases to be added to Accession number list (for indexing) [None]
    - DatOut = Name of output DAT file [Default input FILE.dat]
    - PFam = PFam HMM download [None]
    - SPCode = Species code to use if it cannot be established from sequence name [None]
    - TMHMM = Path to TMHMM program [None]
    - SignalP = Path to SignalP program [None]
    
    Opt:boolean
    - CleanUp = Remove TMHMM directories
    - EnsDat = Look for acc/pep/gene in sequence name [False]
    - FudgeFT = Fudge the real features left/right until a sequence match is found [True]
    - MakeIndex = Whether to make a uniprot index file following run [False]
    
    Stat:numeric
    - DisDom = Disorder threshold below which to annotate PFam domain as "DOMAIN" [0.0]

    List:list
    - Features = Files of addtional features in delimited form [None]    
    - UniFake = List of predictions to add to DAT file [tmhmm,signalp,disorder,pfam]
    - UniReal = Real UniProt data to add to UniFake output ['AC','GN','RC','RX','CC','DR','PE','KW']

    Dict:dictionary    
    - Aliases = Aliases to be added to Accession number list {ID:[Aliases]}
    - Features = Addtional features for given sequence ID {ID:[{Feature,Start,End,Description}]}

    Obj:RJE_Objects
    - SeqList = Main sequence list object
UniFake._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
UniFake._setAttributes(self)
    Sets Attributes of Object.
UniFake.addAlias(self,id,alias)
    Add alias to self.dict['Aliases'][id].
UniFake.addRealUniProt(self,seq,udata,ftlist)
    Updates feature list ft using real UniProt where possible and makes NR.
    >> seq:Sequence object = target of UniFake
    >> udata:UniProt Entry Data dictionary *Modified in place*
    >> ftlist:list of feature dictionaries to add to (and make NR) *Modified in place*
UniFake.loadAlias(self,sourcefile)
    Loads Alias data.
    >> sourcefile:str = Source filename
UniFake.loadFeatures(self,ftfile)
    Loads features from given file.
UniFake.run(self)
    Performs main run method, including both setup and UniFake.
UniFake.setup(self,clear=True)
    Sets up the Aliases and Features dictionaries.
UniFake.uniFake(self,seqs=[],store=False)
    Main UniFake method. Runs on sequences in self.obj['SeqList'] if no seqs given.

unifake Module Methods

unifake.runMain()




unifake Module ToDo Wishlist

    # [ ] : Expand automatic species identification.

extras/:

rje_pattern_discovery [version 1.3] Pattern Discovery Module ~ [Top]

Module: rje_pattern_discovery
Description: Pattern Discovery Module
Version: 1.3
Last Edit: 07/10/06
Imports: rje, rje_seq
Imported By: None
Copyright © 2006 Richard J. Edwards - See source code for GNU License Notice

Function:
Calls and tidies TEIRESIAS. Will add SLIMDISC in time. Will also read motifs and output information content.

SLiMDisc Searching (Tested with slimdisc_V1.3.py):
seqin=FILE : Sequence File to search [None]
slimfiles=LIST : List of files for SLiMDisc discovery. May have wildcards. (Over-ruled by seqin=FILE.) [*.fas]
minsup=X : Min. number of sequences to have in file [3]
maxsup=X : Max. number of sequences to have in file [0]
slimdisc=T/F : Whether to run list of files through SLiMDisc [False]
slimopt="X" : Text string of additional SLiMDisc options [""]
useres=T/F : Whether to use existing results files or overwrite (-BT -TT) [True]
remhub=X : If X > 0.0, removes "hub" protien ("HUB_PPI") and any proteins >=X% identity to hub [0.0]
: Renames datasets as -RemHub, -KeptHub or -NoHub
slimsupport=X : Min. SLiMDisc support (-S X). If < 1, it is a proportion of input dataset size. [0.1]
slimranks=X : Return top X SLiMDisc ranked sequences [1000]
slimwall=X : TEIRESIAS walltime (minutes) in SLiMDisc run (-W X) [60]
slimquery=T/F : Whether to pull out Query Protein from name of file qry_QUERY [False]
memsaver=T/F : Whether to run SLiMDisc in memory_saver mode (-X T) [True]
bigfirst=T/F : Whether to run the biggest datasets first (e.g. for ICHEC taskfarm) [True]
slimversion=X : Version of SLiMDisc to run (See slimcall=X for batch file jobs) [1.4]

Batch File Output options:
batchout=FILE : Create a batch file containing individual seqin=FILE calls. [None]
slimcall="X" : Call for SLiMDisc in batch mode. May have leading commands. ["python slimdisc_V1.4.py"]

TEIRESIAS Searching: *Currently windows-tested only* (Pretty obselete with functional SLiMDisc)
mysql=T/F : MySQL mode [True]
delimit=X : Text delimiter [\\t]
info=FILE : Calculate information content of motifs in FILE (based on AA Freq from seqin, if given) [None]
igap=X : Information Content "Gap penalty" (wildcard penalisation) [0]
outfile=FILE : Output file name [seqin.teiresias.*]
teiresias=T/F : Whether to perform TEIRESIAS search on seqin=FILE [True]
teiresiaspath=PATH : Path to TEIRESIAS ['c:/bioware/Teiresias/teiresias_char.exe'] * Use forward slashes (/)
teiresiasopt=X : Options for TEIRESIAS Search (Remember "X Y Z" for spaced cmds) e.g. "-bc:\bioware\Teiresias\equiv.txt"
["-l3 -w10 -c1 -k2 -p"]

Uses general modules: copy, glob, math, os, re, string, sys, time
Uses RJE modules: rje, rje_seq, presto
Other modules needed: rje_blast, rje_dismatrix, rje_pam, rje_sequence, rje_uniprot, rje_motif

Pattern Class

	Pattern Class. Author: Rich Edwards (2005).

	Info:str
	- Pattern = Pattern RegExp
	
	Opt:boolean

	Stat:numeric
	- OccCount = Number of occurrences in all sequences
	- SeqCount = Number of sequences it occurs in
	- Info = Information Content
	- Length = Length of pattern
	- Fixed = Number of fixed postions in pattern
	- Wildcards = Number of wildcard positions in pattern

	Obj:RJE_Objects
Pattern._makeLength(self)
    		Makes length from pattern.
Pattern._setAttributes(self)
    		Sets Attributes of Object:
		- Info:str ['Pattern']
		- Opt:boolean []
		- Stats:float ['OccCount','SeqCount','Info','Length','Fixed','Wildcards']
		- Obj:RJE_Object []

PatternDiscovery Class

	PatternDiscovery Class. Author: Rich Edwards (2005).

	Info:str
	- Name = Output file name
	- SeqIn = Name of single input file
	- TeiresiasPath = Path to TEIRESIAS ['c:/bioware/Teiresias/teiresias_char.exe'] * Use forward slashes (/)
	- TeiresiasOpt = Options for TEIRESIAS Search (Remember "X Y Z" for spaced cmds) ["-l3 -w10 -c1 -k2 -p"]
	- Info = List of motifs for information content calculations
	- SlimOpt = Text string of additional SLiMDisc options [""]
	- BatchOut = Create a batch file containing individual seqin=FILE calls. [None]
	- SlimCall = Call for SLiMDisc in batch mode. May have leading commands. ["python slimdisc_V1.4.py"]
	- SlimVersion = Version of SLiMDisc to run [1.4]
	
	Opt:boolean
	- Teiresias = Whether to perform TEIRESIAS search on seqin=FILE [False]
	- SLiMDisc = Whether to run list of files through SLiMDisc [False]
	- UseRes = Whether to use existing results files or overwrite (-BT -TT) [True]
	- SlimQuery = Whether to pull out Query Protein from name of file qry_QUERY [False]
	- BigFirst = Whether to run the biggest datasets first (e.g. for ICHEC taskfarm) [True]

	Stat:numeric
	- InfoGapPen = Information Content "Gap penalty" (wildcard penalisation) [0]
	- MinSup = Min. number of sequences to have in file [3]
	- MaxSup = Max. number of sequences to have in file [200]
	- RemHub = If X > 0.0, removes "hub" protien ("HUB_PPI") and any proteins >=X% identity to hub [0.0]
	- SlimRanks = Return top X SLiMDisc ranked sequences [1000]
	- SlimWall = TEIRESIAS walltime (minutes) in SLiMDisc run (-W X) [60]
	- SlimSupport = Min. SLiMDisc support (-S X). If < 1, it is a proportion of input dataset size [0.1

    List:list
    - Pattern = List of pattern objects
    - SlimFiles = List of files for SLiMDisc discovery. May have wildcards [*.fas]

    Dict:dictionary    

	Obj:RJE_Objects
	- SeqList = Sequence List Object
PatternDiscovery._cmdList(self)
    		Sets attributes according to commandline parameters:
		- see .__doc__ or run with 'help' option
PatternDiscovery._setAttributes(self)
    		Sets Attributes of Object:
		- Info:str ['TeiresiasPath','TeiresiasOpt','Info','SlimOpt','SeqIn','BatchOut','SlimCall','SlimVersion']
		- Opt:boolean ['Teiresias','MySQL','SLiMDisc','UseRes','SlimQuery','BigFirst']
		- Stats:float ['InfoGapPen','MinSup','MaxSup','RemHub','SlimRanks','SlimWall','SlimSupport']
		- List:list ['Pattern','SlimFiles']
		- Dict:dictionary []
		- Obj:RJE_Object []
PatternDiscovery.addTeiresiasPattern(self,patstats)
    		Adds a pattern using given patstats.
		>> patsats:list of strings [tot_occ,seq_occ,pattern,occ_list]
PatternDiscovery.calculateInformationContent(self,pattern='',aafreq={},maxinfo=0.0,gap_pen=0.0)
    		Calculates Information Content of a pattern given aa frequencies. Adapted from Norman Davey.
		>> pattern:str = motif pattern
		>> aafreq:dict = aa frequencies
		>> maxinfo:float = maximum information content score
PatternDiscovery.calculateScore(self,pattern,aaOcc)
    		Calculates Information Score - Taken from AERPIMP. Autor: Norman Davey.
		>> pattern:str = pattern
		>> aaOcc:dictionary of {aa:freq}
		<< information score
PatternDiscovery.run(self)
    		Main Run method.
PatternDiscovery.slimDisc(self)
    Runs SLiMDisc on batch of files.

rje_pattern_discovery Module Methods

rje_pattern_discovery.runMain()




sfmap2png [version 0.0] Converts SLiMFinder Mapping files to PNGs ~ [Top]

Module: sfmap2png
Description: Converts SLiMFinder Mapping files to PNGs
Version: 0.0
Last Edit: 01/09/08
Imports: rje, rje_disorder, rje_seq, rje_zen
Imported By: None
Copyright © 2008 Richard J. Edwards - See source code for GNU License Notice

Function:
This module takes a SLiMFinder *.mapping.fas file and uses some R visualisations to convert it into relative
conservation/disorder PNG files.

Commandline:
* seqin=FILE : *.mapping.fas file to use as input data []

See also rje.py generic commandline options and rje_disorder.py commands.

Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje, rje_zen
Other modules needed: None

sfmap2png Module Version History

    # 0.0 - Initial Compilation.

SFMap2PNG Class

    SFMap2PNG Class. Author: Rich Edwards (2008).

    Info:str
    
    Opt:boolean

    Stat:numeric

    List:list

    Dict:dictionary    

    Obj:RJE_Objects
SFMap2PNG._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
SFMap2PNG._setAttributes(self)
    Sets Attributes of Object.
SFMap2PNG.run(self)
    Main run method.
SFMap2PNG.setup(self)
    Main class setup method.
SFMap2PNG.slimJimMapping(self)
    Generate SpokeAln PNGs for all spokes.

sfmap2png Module Methods

sfmap2png.runMain()




sfmap2png Module ToDo Wishlist

    # [ ] : List here

slimpickings [version 3.0] SLiMDisc results compilation and extraction ~ (slim_pickings.py)[Top]

Module: slimpickings
Description: SLiMDisc results compilation and extraction
Version: 3.0
Last Edit: 18/01/07
Imports: rje, rje_disorder, rje_scoring, rje_seq, rje_sequence, rje_uniprot
Imported By: None
Copyright © 2006 Richard J. Edwards - See source code for GNU License Notice

Function:
This is a basic results compiler for multiple SLiMDisc motif discovery datasets. There are currently the following
functional elements to the module:

1. Basic compilation of results from multiple datasets into a single file. This will search through the current
directory and any subdirectories (unless subdir=F) and pull out results into a single comma-separated file
(slimdisc_results.csv or outfile=FILE). With the basic run, the following statistics are output:
['Dataset','SeqNum','TotalAA','Rank','Score','Pattern','Occ','IC','Norm','Sim']
This file can then be imported into other applications for analysis. (E.g. rje_mysql.py can be run on the
file to construct a BUILD statement for MySQL, or StatTranfer can convert the file for STATA analysis etc.)
!!! NB: If multiple datasets (e.g. in subdirectories) have the same name, slim_pickings will become confused and
may generate erroneous data later. Please ensure that all datasets are uniquely named. !!!

2. Additional optional stats based on the motifs sequences themselves to help rank and filter interesting results.
These are:
- AbsChg = Number of charged positions [KRDE]
- NetChg = Net charge of motif [KR] - [DE]
- BalChg = Balance of charge = Net charge in first half motif - Net charge in second half
- AILMV = Whether all positions in the motif are A,I,L,M or V.
- Aromatic = Count of F+Y+W
- Phos = Potential phosphorylated residues X (none) or [S][T][Y], whichever are present

3. Calculation of additional statistics from the input sequences, using PRESTO. These are:
- Mean IUPred/FoldIndex Protein Disorder around the motif occurrences (including extended window either side)
- Mean Surface Accessibility around the motif occurrences (including an extended window either side)
- Mean Eisenberg Hydrophobicity around the motif occurrences (including an extended window either side)
- SLiM conservation across orthologous proteins. (This calculation needs improving.)
The mean for all occurrences of a motif will be output. In addition, percentile steps can be used to assess
motifs according to selected threshold criteria (in another package). This will return the threshold at a
given percentile, e.g. SA_pc75=2.0 would mean that 75% of occurrences have a mean Surface Accessibility value
of 2.0 or greater. (For hydrophobicity, Hyd_pc50=0.3 would be 50% of occurrences have mean Hydrophobicity of
0.3 or *less*. This is because low hydrophobicity is good for a (non-structural) functional motif.)

4. Collation and extraction of key data for specific results. These may be by any combination of protein, motif
or dataset. If a list of datasets is not given, then all datasets will be considered. (Likewise proteins and
motifs.) To be very specific, all three lists may be specified (slimlist=LIST protlist=LIST datalist=LIST).
Information is pulled out in a two-step process:
(1) The slimdisc.*.index files are consulted for the appropriate list of datasets. If missing, these will
be regenerated. (slimdisc.motif.index and slimdisc.protein.index both point to dataset names.
slimdisc.dataset.index points these names to the full path of the results.) Only datasets returned by
all appropriate lists will be analysed for data extraction.
(2) The appropriate data on the motifs will be extracted into a directory as determined by outdir=PATH.
Depending on the options selected, the following (by default all) data is returned:
- *.motifaln.fas = customised fasta file with motifs aligned in different sequences, ready for dotplots
and manual inspection for homology not detected by BLAST.
- *.dat = UniProt DAT file for as many parents as possible.
These files will be saved in the directory set by outdir=PATH.

5. Re-ranking of results. rerank=X will now re-rank the results for each dataset according to the statistic set by
rankstat=X, and output the top X results only. By default, this is the "R-score" = ic * norm * occ / exp. The output
"Rank" will be replaced with the new rank and a new column "OldRank" added to the ouput. zscore=T/F turns on and off
a simple Z-score calculation based on the slimranks read in. Version 2.5 added a new option for a crude length
correction of the RScore, dividing by 20 to the power of the motif IC (as calculated by SLiM Pickings on a scale of
1.0 per fixed position). This is controlled by the lencorrect=T/F option. By default this is False (for backwards
compatibility) but with future versions this may become the default as it is assumed (by me) that it will improve
performance. However, there is currently no justification for this, so use with caution!

6. Filtering of results using the statfilter=LIST option, allowing results to be filtered according to a set of
rules: LIST should be (a file containing) a comma-separated list of stats to filter on, consisting of X*Y where X is
an output stat (the column header); * is an operator in the list >, >=, =, =< ,< ; and Y is a value that X must have,
assessed using *. This filtering is crude and may behave strangely if X is not a numerical stat (although Python does
seem to assess these alphabetically, so it may be OK)! This filtering is performed before the reranking of the motifs
if rerank=X is used. This can make run times quite long as many more motifs need stats calculations. (If rerank=X is
used without statfilter, re-ranking is done earlier to save time.) See the manual for details.

7. !!!NEW!!! with version 3.0, customised scores can be created using the newscore=LIST option, where LIST is in the
form X:Y,X:Y, where in turn X is the name for the new score (a column with this name will be produced) and Y is the
formula of the score. This formula may contain any output column names, numbers and the operators +-*/^ (^ is "to
the power of"), using brackets to set the order of calculation. Without brackets, a strict left to right hierarchy is
observed. e.g. * newscore=Eg:3+2*6 will generate a column called "Eg" containing the value 30.0. Custom scores can
feature previously defined custom scores in the command options, so a second newscore call could be
* newscore=Eg:3+2*6,Eg2:Eg^2 (= Eg squared = 900.0). This can be used in conjunction with statfilter,
e.g. * newscore=UDif:UHS-UP statfilter=UDif>1.

Commandline:
## Basic compilation options ##
* outfile=FILE : Name of output file. [slimdisc_results.csv]
* dirlist=LIST : List of directories from which to extract files (wildcards OK) [./]
* compile=T/F : Compile motifs from SliMDisc rank files into output file. (False=index only) [True]
* append=T/F : Append file rather than over-writing [False]
* slimranks=X : Maximum number of SlimDisc ranks to exract from any given dataset [5000]
* rerank=X : Re-ranks according to RScore (if expect=T) and only outputs top X new ranks (if > 0) [5000]
* rankstat=X : Stat to use to re-rank data [RScore]
* motific=T/F : Recalulate IC using PRESTO. Used for re-ranking. OldIC also output. [False]
* lencorrect=T/F : Implements crude length correction in RScore [False]
* delimit=X : Change delmiter to X [,]

## Advanced compilation options ##
* subdir=T/F : Whether to search subdirectories for rank files [True]
* webid=LIST : List of SLiMDisc webserver IDs to compile. (Works only on bioware!) []
* slimversion=X : SLiMDisc results version for compiled output [1.4]

## Additonal statistics ##
* abschg=T/F : Whether to output number of charged positions (KRDE) [True]
* netchg=T/F : Whether to output net charge of motif (KR) - (DE) [True]
* balchg=T/F : Whether to output the *balance* of charge (netNT - netCT) [True]
* ailmv=T/F : Whether to output if all positions in the motif are A,I,L,M or V. [True]
* aromatic=T/F : Whether to output count of F+Y+W [True]
* phos=T/F : Whether to output potential phosphorylated residues X (none) or [S][T][Y], if present [True]
* expect=T/F : Calculate min. expected occurrence of motif in search dataset [True]
* zscore=T/F : Calculate z-scores for each motif using the entire dataset (<=slimranks) [True]
* newscore=LIST : Lists of X:Y, create a new statistic X, where Y is the formula of the score. []
* custom=LIST : Calulate Custom score as a produce of stats in LIST []

## Additional calculations to make: see PRESTO for additional relevant commandline options ##
* slimsa=T/F : Calculate SA information for SLiMDisc Results [True]
* winsa=X : Number of aa to extend Surface Accessibility calculation either side of motif [0]
* slimhyd=T/F : Calculate Eisenbeg Hydophobicity for SLiMDisc Results [True]
* winhyd=X : Number of aa to extend Eisenberg Hydrophobicity calculation either side of motif [0]
* slimcons=T/F : Calculate Conservation stats for SLiMDisc results [False]
- See PRESTO conservation options. (NB. consamb does nothing.)
* slimchg=T/F : Calculate selected charge statistics (above) for occurrences in addition to pattern [False]
* slimfold=T/F : Calculate disorder using FoldIndex over the internet [False]
* slimiup=T/F : Calculate disorder using local IUPred [True]
* windis=X : Number of aa to extend disorder prediction each side of occurrence [0]
* iucut=X : Cut-off for IUPred results [0.2]
* iumethod=X : IUPred method to use (long/short) [short]
* iupath=PATH : The full path to the IUPred exectuable [c:/bioware/iupred/iupred.exe]
* percentile=X : Percentile steps to return in addition to mean [0]

## Collation and Extraction of specific results ##
* index=T/F : Whether to create index files (slimpicks.*.index) for proteins, motifs and datasets [True]
* bigindex=T/F : Whether to use the special makeBigIndexFiles() method [False]
* fullpath=T/F : Whether to use full path (else relative) for dataset index [True]
* slimpath=PATH : Path to place (or find) index files. *Cannot be used for extraction if fullpath=F* [./]
* slimlist=LIST : List (A,B,C) or FILE containing list of SLiMs (motifs) to extract []
* protlist=LIST : List (A,B,C) or FILE containing list of proteins for which to extract results []
* datalist=LIST : List (A,B,C) or FILE containing list of datasets for which to extract results []
* strict=T/F : Only extract protein/occurrence details for those proteins in protlist [False]
(False = extract details for all proteins in datalist datasets containing slimlist motifs)
* outdir=PATH : Directory into which extracted data will be placed. [./]
* picksid=X : Outputs an extra 'PicksID' column containg the identifier X []
* inputext=LIST : List of file extensions for original input files. (Should be in same dir as *.rank, or one dir above)
[dat,fas,fasta,faa]
* indexre=LIST : List of alternative regular expression patterns to try for index retrieval []
- ipi : 'ipi_HUMAN__(\S+)-*\d*=(\S.+)', # IPI Human sequence
- ipi_sv : '^ipi_HUMAN__([A-Za-z0-9]+)-*\d*=(\S.+)', # IPI Human UniProt splice variant
- ft : '^(\S+)_HUMAN=(\S+)', # SLiMDisc FullText (UniProt format) retrieval
- ft_sv : '^([A-Za-z0-9]+)-*\d*_HUMAN=(\S+)' # SLiMDisc FullText (UniProt format) splice variant

## Additional Output for Extracted Motifs ##
* occres=FILE : Output individual occurrence data in FILE [None]
* extract=T/F : Extract additional data for motifs [True if datasets/SLiMs/accnums given, else False]
* motifaln=T/F : Produce fasta files of local motif alignments [True]
* flanksize=X : Size of sequence flanks for motifs [30]
* xdivide=X : Size of dividing Xs between motifs [10]
* datout=FILE : Extract UniProt entries from parent proteins where possible into FILE [uniprot_extract.dat]
* unitab=T/F : Make tables of UniProt data using rje_uniprot.py [True]
* ftout=FILE : Make a file of UniProt features for extracted parent proteins, where possible, incoroprating SLIMs [None]
* unipaths=LIST : List of additional paths containing uniprot.index files from which to look for and extract features ['']
* peptides=T/F : Peptide design around discovered motifs [False]

## Additional Output for Proteins ##
* proteinaln=T/F : Search for alignments of proteins containing motifs and produce new file containing motifs [True]
* gopher=T/F : Use GOPHER to generate missing orthologue alignments in alndir - see gopher.py options [False]
* alndir=PATH : Path to alignments of proteins containing motifs [./] * Use forward slashes (/) [Gopher/ALN/]
* alnext=X : File extension of alignment files, accnum.X [orthaln.fas]

## Advanced Filtering Options ##
* statfilter=LIST : List of stats to filter (remove matching motifs) on, consisting of X*Y where:
- X is an output stat (the column header),
- * is an operator in the list >, >=, !=, =, >= ,< !!! Remember to enclose in "quotes" for <> !!!
- Y is a value that X must have, assessed using *.
This filtering is crude and may behave strangely if X is not a numerical stat!
* zfilter=T/F : Calculate the Z-score on the filtered dataset (True) or the whole dataset (False) [False]
* rankfilter=T/F : Re-ranks the filtered dataset (True) rather than the whole (pre-filtered) dataset (False) [True]
- NB. If zfilter=T then rankfilter=T.

## Old/obselete options ##
* advprob=T/F : Calculate advanced probability based on actual sequences containing motifs [False] #!# Not right yet!! #!#
* advmax=X : Max number of sequences to use computationally intensive advanced probability [35]

*** See RJE_UNIPROT options for UniProt settings ***

Uses general modules: copy, math, os, re, string, sys, time
Uses RJE modules: rje, gopher, presto, rje_disorder, rje_motif, rje_scoring, rje_seq, rje_sequence, rje_uniprot
Other modules needed: rje_blast, rje_dismatrix, rje_pam

SlimPicker Class

    SlimPicker Class. Author: Rich Edwards (2005).

    Info:str
    - Name = Name of output file [slimdisc_results.csv]
    - OutDir = Directory into which extracted data will be placed [./]
    - DatOut = Extract UniProt entries from parent proteins where possible into FILE [uniprot_extract.dat]
    - OccRes = Output individual occurrence data in FILE [None]
    - SlimPath = Path to place (or find) index files. *Cannot be used for extraction if fullpath=F* [./]
    - FTOut = Make a file of UniProt features for extracted parent proteins, where possible, incoroprating SLIMs [None]
    - UniPaths = List of paths containing uniprot.index files from which to look for and extract features [/home/richard/Databases/UniProt/]
    - AlnDir = Path to alignments of proteins containing motifs [./] * Use forward slashes (/) [Gopher/ALN/]
    - AlnExt = File extension of alignment files, accnum.X [aln.fas]
    - SlimVersion = SLiMDisc results version for compiled output [1.4]
    - Extract = Stores command for extraction of additional data for motifs, for setting default value in self.pickSetup()
    - RankStat = Stat to use to re-rank data [RScore]
    - PicksID = Outputs an extra 'PicksID' column containg the identifier X []
    
    Opt:boolean
    * Basic Stats Output *
    - Compile = Compile motifs from SliMDisc rank files into output file. (False=index only) [True]
    - Extract = Extract additional data for motifs [True if datasets/SLiMs/accnums given, else False]
    - AbsChg = Whether to output number of charged positions (KRDE) [True]
    - NetChg = Whether to output net charge of motif (KR) - (DE) [True]
    - BalChg = Whether to output the *balance* of charge (netNT - netCT) [True]
    - AILMV = Whether to output if all positions in the motif are A,I,L,M or V. [True]
    - Aromatic = Whether to output count of F+Y+W [True]
    - Phos = Whether to output potential phosphorylated residues X (none) or [S][T][Y], whichever are present [True]
    - Expect = Calculate min. expected occurrence of motif in search dataset [True]
    - AdvProb = Calculate advanced probability based on actual sequences containing motifs [True]
    - SubDir = Whether to search subdirectories for rank files [True]
    - MotifIC = Recalulate IC using PRESTO. Used for re-ranking. OldIC also output. [False]
    * PRESTO Stats Output *
    - SlimSA = Calculate SA information for SLiMDisc Results [True]
    - SlimHyd = Calculate Eisenbeg Hydophobicity for SLiMDisc Results [True]
    - SlimCons = Calculate Conservation stats for SLiMDisc results [False]
    - SlimChg = Calculate selected charge statistics (above) for occurrences in addition to pattern [False]
    - SlimFold = Calculate disorder using FoldIndex over the internet [False]
    - SlimIUP = Calculate disorder using local IUPred [True]
    * Specific Results extraction *
    - Index = Whether to create index files (slimpicks.*.index) for proteins, motifs and datasets [True]
    - BigIndex = Whether to use the special makeBigIndexFiles() method
    - FullPath = Whether to use full path (else relative) for dataset index [True]
    - MotifAln = Produce fasta files of local motif alignments [True]
    - UniTab = Make tables of UniProt data using rje_uniprot.py [True]
    - Strict = Only extract protein details for those proteins in protlist [False]
              (False = extract details for all proteins in datalist datasets containing slimlist motifs)
    - ProteinAln = Search for alignments of proteins containing motifs and produce new file containing motifs [True]
    - Gopher = Use GOPHER to generate missing orthologue alignments in outdir/Gopher - see gopher.py options [False]
    * Special *
    - ZScore = Calculate z-scores for each motif using the entire dataset (<=slimranks) [True]
    - ZFilter = Calculate the Z-score on the filtered dataset (True) or the whole dataset (False) [False]
    - RankFilter = Re-ranks the filtered dataset (True) rather than the whole (pre-filtered) dataset (False) [True]
    - LenCorrect = Implements crude length correction in RScore [False]

    Stat:numeric
    - SlimRanks = Maximum number of SlimDisc ranks to exract from any given dataset (0=all) [0]
    - Percentile = Percentile steps to return in addition to mean [25]
    - FlankSize = Size of sequence flanks for motifs [30]
    - XDivide = Size of dividing Xs between motifs [10]
    - AdvMax = Max number of sequences to use computationally intensive advanced probability [35]
    - ReRank = Re-ranks according to RScore (if expect=T) and only outputs top X new ranks (if > 0) [0]

    List:list
    - DirList = List of directories from which to extract files (wildcards OK) [./]
    * Specific Results Extraction *
    - SlimList = List (A,B,C) or FILE containing list of SLiMs (motifs) to extract []
    - ProtList = List (A,B,C) or FILE containing list of proteins for which to extract results []
    - DataList = List (A,B,C) or FILE containing list of datasets for which to extract results []
    - InputExt = List of file extensions for original input files. (Should be in same dir as *.rank, or one dir above) [dat,fas]
    - FileList = List of actual rank files to check out (paths to these files) #~# Setup in run() #~#
    - Specials = List of special (SlimX) statistics to calculate. #~# Setup in run() #~#
    - IndexRE = List of alternative regular expression patterns to try for index retrieval []
    ## Advanced Filtering/Ranking Options ##
    - StatFilter = List of stats to filter on, consisting of X*Y where:
          - X is an output stat (the column header),
          - * is an operator in the list >, >=, =, <= ,<, !=
          - Y is a value that X must have, assessed using *.
          This filtering is crude and may behave strangely if X is not a numerical stat!
    - Custom = Calulate Custom score as a produce of stats in LIST []
    - WebID = List of SLiMDisc webserver IDs to compile. (Works only on bioware!) []
    - NewScore = self.dict['NewScore'] keys() in order they were read in

    Dict:dictionary
    - IndexFiles = Dictionary of index file names and types {'Slim':PATH, etc.} #~# Setup in run() #~#
    - StatFilter = Dictionary of stat filters made from self.list['StatFilter']
    - NewScore = dictionary of {X:Y} for new statistic X, where Y is the formula of the score. []

    Obj:RJE_Objects
SlimPicker._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
SlimPicker._getAllRankFiles(self)
    Populates self.list['FileList'] with all rank files.
SlimPicker._getListFromIndex(self,type,searchlist)
    Returns a list of returned index values for searchlist.
    >> type:str = Type of slim index search (Data,Prot or Slim)
    >> searchlist:list of target strings
    << (returnlist,replacedict) = list of returned targets and a dictionary of {search:[found]} for alternative REs.
SlimPicker._getRankFilesFromIndexFiles(self)
    Populates self.list['FileList'] using index files and self.lists. This method is only called if one of the
    Data/Prot/Slim List lists is populated. This method works by the following:
    1. Make a list of OK datasets from ProtList and SlimList - only keep common datasets
    2. Extract full path from those datasets that also occur in DataList
    - if any list is empty, then it is ignored as a potential filter!
SlimPicker._setAttributes(self)
    Sets Attributes of Object
SlimPicker.compileOut(self,filename,headers,delimit,datadict={})
    Outputs a single line of compiled data to file.
    >> filename:str = Name of output file
    >> headers:list of field headers
    >> delimit:str = text delimiter
    >> datadict:dictionary of {header:str} = data to output. If none, will output headers themselves.
SlimPicker.customScore(self,motiflist,new)
    Calculates custom score for all motifs in dataset.
    >> motiflist:list of patterns (in rank order)
    >> new:str = Key for self.dict['NewScore']
SlimPicker.expectation(self,slim_presto,motif_pnames,fasta)
    Calculates expected support and probability. This part of the program has three functions:
        1. Calculate the expected *support* of a motif from the input dataset
        2. Use this with the observed support to calculate the probability of seeing at least that much support
        3. Create a dictionary of probabilities that each motif will occur in each protein (for occres)
    >> slim_presto:PRESTO object containing Motif objects (pattern = motif.info['Name'])
    >> motif_pnames:dictionary of {Motif:[pnames]}
    >> fasta:str = Name of TEIRESIAS input file. Link sequences via shortName()
    << prob_occ:Dictionary of {pname:{Motif:prob}}
SlimPicker.makeBigIndexFiles(self)


SlimPicker.makeIndexFiles(self)


SlimPicker.motifAlignments(self,dataset,slim_presto,motif_occ,motiflist)
    Makes alignments of the occurrences of each motif.
    >> dataset:str = Dataset name
    >> slim_presto:Presto object containing motifs
    >> motif_occ:Dictionary of {Sequence:{Motif:[positions]}} (Pos is from 0 to L-1)
    >> motiflist:List of motif strings to consider (filtered by motif and protein) *in rank order*
SlimPicker.motifOccDict(self,inputseq=None,extract_prots=[])
    Creates self.dict['Motif PNames'] and self.dict['Motif Occ'].
    >> inputseq:str = Original input sequence file [None]
    >> extract_prots = List of proteins to extract details for *in input data order*
SlimPicker.patternStats(self,pattern)
    Returns a dictionary of stats (AbsChg,NetChg,BalChg,AILMV,Aromatic,Phos).
    >> pattern:str = Pattern - could be a motif or an occurrence of a motif.
SlimPicker.pickSetup(self)
    Sets up variables and output etc. for picks() SLiMDisc Results compiler.
    Generates additonal class attributes (former method-specific attributes in brackets):
    - self.info['Delimit'] = text delimiter for output (delimit)
    - self.opt['PRESTO Calculations'] = whether to perform occurrence-specific PRESTO calculations (presto_calc)
    - self.list['PRESTO Stats'] = PRESTO occurrence stats headers (presto_calc)
    - self.opt['Extract'] = whether to extract occurrence data for results (alignments, UniProt etc.)
    - self.list['Headers'] = list of headers for compiled output (headers)
    - self.list['Special Stats'] = list of "special statistics" to calculate (special_stats)
    - self.obj['SlimPRESTO'] = slim_presto    # PRESTO object for storing motifs and making calculations
    - self.obj['SlimSeq'] = slim_seq      # SeqList object for handling search dataset sequences
SlimPicker.picks(self)
    SLiMDisc Results compiler.
SlimPicker.prestoStats(self,slim_presto,motif_occ,prob_occ)
    Populates and returns presto_seqhit list of objects storing data.
    >> slim_presto:Presto object containing motifs
    >> motif_occ:Dictionary of {Sequence:{Motif:[positions]}} (Pos is from 0 to L-1)
    >> prob_occ:Dictionary of {pname:{Motif:prob}}
SlimPicker.proteinAlignments(self,slim_presto,motif_occ)
    Generates copies of alignments including SLIMs.
    >> slim_presto:Presto object containing motifs  #!# Not used! #!#
    >> motif_occ:Dictionary of {Sequence:{Motif:[positions]}} (Pos is from 0 to L-1)
SlimPicker.reRank(self,motiflist)
    ReRanks according to the new RScore and imposes second rank cut-off.
    >> motiflist:list of patterns (in rank order)
    << motiflist:list of patterns, re-ranked and filtered.
SlimPicker.reduceMotifOcc(self,motiflist)
    Reduce self.dict['Motif Occ'] to match reduced motiflist.
    >> motiflist:list of retained motifs following filters etc.
SlimPicker.run(self)
    Main controlling run method for new Slim Pickings 1.0.
SlimPicker.slimUniProt(self,dataset,motiflist)
    Creates self.dict['SlimFT'] and self.list['UniExtract'].
    >> dataset:str = current dataset being processed.
    >> motiflist:list of patterns to output
SlimPicker.statFilter(self,motiflist,type='All')
    Filters motifs according to self.list['StatFilter']. 
    >> motiflist:list of patterns 
    >> type:str = P/O Stats or All for both (default)
    << motiflist:list of filtered patterns.
SlimPicker.zScores(self,motiflist)
    Calculates Z-Scores for all motifs in dataset.
    >> motiflist:list of patterns (in rank order)

slim_pickings Module Methods

slim_pickings.runMain()




libraries/:

ned_SLiMPrints ~ [Top]

This module was not recognised by rje_pydocs.

ned_SLiMPrints_Tester ~ [Top]

This module was not recognised by rje_pydocs.

ned_alignmentHelper ~ [Top]

This module was not recognised by rje_pydocs.

ned_anchorHelper ~ [Top]

This module was not recognised by rje_pydocs.

ned_basic ~ [Top]

This module was not recognised by rje_pydocs.

ned_basic Module Methods

ned_basic.alignMotif(str1,str2,returnScore=False)
    		if maxOffset[0] >= 0:
			print origStr1
			print "-"*maxOffset[0] + origStr2
		else:
			print "-"*abs(maxOffset[0]) + origStr1
			print origStr2#"""

		if returnScore:
			return [maxOffset[0],float(maxOffset[1])/(strLength - str2.count("."))]
		else:
			return maxOffset[0]
			
			
ned_basic.alignStrings(str1,str2,returnScore=False)


ned_basic.binList(list,logValue=10,log=False,normaliser=1)


ned_basic.diffLists(list1,list2)


ned_basic.fileAgeDays(filepath)


ned_basic.fileChecker(filepath,fileDesc,exit=False)


ned_basic.gauss(x)


ned_basic.getProb(offsets,scores)


ned_basic.inFrame(start,stop,rangeStart,rangeStop)


ned_basic.informationContent(column,IC="")


ned_basic.isInt(x)


ned_basic.overlap(start,stop,rangeStart,rangeStop)


ned_basic.plotDist(dist)


ned_basic.plotList(plot,adjuster=100)


ned_basic.printDict(tmpDict)


ned_basic.removeRedundency(list)


ned_basic.removeRedundencyOrdered(list)


ned_basic.std_dev(list_temp)


ned_basic.upd(u,n)


ned_basic.within(start,stop,offset)


ned_basic.writeError(e)




ned_basicReader ~ [Top]

This module was not recognised by rje_pydocs.

ned_basicReader Module Methods

ned_basicReader.readTableFile(tablePath,delimiter="\t",hasHeader=True,key="",byColumn=True,relationship="binary",stripQuotes=False)




ned_commandLine ~ [Top]

This module was not recognised by rje_pydocs.

ned_conservationScorer ~ [Top]

This module was not recognised by rje_pydocs.

ned_disorderHelper ~ [Top]

This module was not recognised by rje_pydocs.

ned_dsspHelper ~ [Top]

This module was not recognised by rje_pydocs.

ned_eigenvalues [version 1.0] Modified N. Davey Relative Local Conservation module ~ [Top]

Module: ned_eigenvalues
Description: Modified N. Davey Relative Local Conservation module
Version: 1.0
Last Edit: 03/09/09
Imports:
Imported By: rje_slimcalc
Copyright © 2009 Norman E. Davey & Richard J. Edwards - See source code for GNU License Notice

Function:
This module is not for standalone running and has no commandline options (including 'help'). All options are handled
by the parent module.

Uses general modules: operator, math, random

ned_eigenvalues Module Methods

ned_eigenvalues.add(a,b)


ned_eigenvalues.approx_equalto(z)


ned_eigenvalues.convert_to_Matrix(matrix)


ned_eigenvalues.divide(mat,val)


ned_eigenvalues.dot(v1,v2)


ned_eigenvalues.equal(x)


ned_eigenvalues.getconj(z)


ned_eigenvalues.isList(l)


ned_eigenvalues.list_approx_equaltozero(l)


ned_eigenvalues.minus(a,b)


ned_eigenvalues.mmul(a,b)


ned_eigenvalues.mmul_int(a,b)


ned_eigenvalues.mmul_vect(a,b)


ned_eigenvalues.multiply(mat,val)


ned_eigenvalues.norm(table)


ned_eigenvalues.normalise(table)


ned_eigenvalues.outer(v1,v2)


ned_eigenvalues.read_scoring_matrix(sm_file)
    Read in a scoring matrix from a file, e.g., blosum80.bla, and return it as an array.

ned_fastaHelper ~ [Top]

This module was not recognised by rje_pydocs.

ned_motifHelper ~ [Top]

This module was not recognised by rje_pydocs.

ned_mutationHelper ~ [Top]

This module was not recognised by rje_pydocs.

ned_pfam ~ [Top]

This module was not recognised by rje_pydocs.

ned_proteinInfoHelper ~ [Top]

This module was not recognised by rje_pydocs.

ned_rankbydistribution [version 1.0] Modified SLiMFinder stats module ~ [Top]

Module: ned_rankbydistribution
Description: Modified SLiMFinder stats module
Version: 1.0
Last Edit: 03/09/09
Imports: rje_seq, rje_uniprot, rje, rje_blast, rje_slim
Imported By: slimfinder
Copyright © 2009 Norman E. Davey & Richard J. Edwards - See source code for GNU License Notice

Function:
This module is a stripped down template for methods only. This is for when a class has too many methods and becomes
untidy. In this case, methods can be moved into a methods module and 'self' replaced with the relevant object.

Commandline:
This module is not for standalone running and has no commandline options (including 'help'). All options are handled
by the parent module.

Uses general modules: re, copy, random, math, sys, time, os, pickle, sets, string, traceback
Uses RJE modules: rje_seq, rje_uniprot, rje, rje_blast, rje_slim

ned_rankbydistribution Module Version History

    # 0.0 - Initial Compilation based on "25 september version". (Probably not!)

ned_rankbydistribution Module Methods

ned_rankbydistribution.patternFromCode(slim)


ned_rankbydistribution.reformatMotif(slim)
    Reformats motif from SLiMFinder "SLiM" format.
ned_rankbydistribution.slimDif(slim1,slim2)


ned_rankbydistribution.slimLen(slim)


ned_rankbydistribution.slimPos(slim)




ned_rankbydistribution Module ToDo Wishlist

    # [ ] : Fix the bugs with the new implementation.

ned_stats ~ [Top]

This module was not recognised by rje_pydocs.

ned_stats Module Methods

ned_stats.binomial(k,n,p)


ned_stats.choose(k,n)


ned_stats.cum_binomial(k,n,p)


ned_stats.cum_poisson(expected,observed)


ned_stats.cum_uniform_product(p,n,alternate = False)


ned_stats.erf(z)


ned_stats.factorial(m,s=0)


ned_stats.factorialGamma(m,s=0)


ned_stats.factorialRamanujan(m,s=0,logged=True)


ned_stats.fishers(a,b,c,d)


ned_stats.gamma(z)


ned_stats.incomplete_gamma(s,x,alternate=False)


ned_stats.poisson(expected,observed)


ned_stats.product(values)


ned_stats.rlcProb(rlc,u,sd)


ned_stats.std_dev(list_temp)


ned_stats.uniform_product_density(p,n)




ned_structureHelper ~ [Top]

This module was not recognised by rje_pydocs.

ned_timeUtilities ~ [Top]

This module was not recognised by rje_pydocs.

ned_timeUtilities Module Methods

ned_timeUtilities.formatTime(time)




ned_uniprotHelper ~ [Top]

This module was not recognised by rje_pydocs.

rje [version 4.0] Contains General Objects for all my (Rich's) scripts ~ [Top]

Module: rje
Description: Contains General Objects for all my (Rich's) scripts
Version: 4.0
Last Edit: 05/03/11
Imports:
Imported By: aphid, badasp, budapest, comparimotif_V3, fiesta, gablam, gasp, gfessa, gopher_V2, happi, haqesac, multihaq, picsi, pingu, presto_V5, qslimfinder, seqmapper, slimfinder, slimmaker, slimsearch, unifake, compass, file_monster, peptide_dismatrix, peptide_stats, pic_html, prodigis, rem_parser, rje_dbase, rje_itunes, rje_mysql, rje_pattern_discovery, rje_phos, rje_pydocs, rje_scansite, rje_seqgen, rje_seqplot, rje_sleeper, rje_ssds, rje_yeast, seqforker, sfmap2png, slim_pickings, wormpump, RankByDistribution, bob, ned_rankbydistribution, rje_aaprop, rje_ancseq, rje_biogrid, rje_blast, rje_codons, rje_conseq, rje_db, rje_dismatrix, rje_dismatrix_V2, rje_disorder, rje_embl, rje_ensembl, rje_genbank, rje_genecards, rje_genemap, rje_go, rje_haq, rje_hmm, rje_hprd, rje_html, rje_iridis, rje_markov, rje_mascot, rje_menu, rje_motif_V3, rje_motif_cons, rje_motif_stats, rje_motiflist, rje_motifocc, rje_omim, rje_pam, rje_paml, rje_ppi, rje_qsub, rje_scoring, rje_seq, rje_seqlist, rje_sequence, rje_slim, rje_slimcalc, rje_slimcore, rje_slimhtml, rje_slimlist, rje_specificity, rje_svg, rje_tm, rje_tree, rje_tree_group, rje_uniprot, rje_xgmml, rje_xml, rje_zen, slimfrap, slimjim
Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice

Function:
General module containing Classes used by all my scripts plus a number of miscellaneous methods.
- Output to Screen, Commandline parameters and Log Files

Commandline options are all in the form X=Y. Where Y is to include spaces, use X="Y".

General Commandline:
* v=X : Sets verbosity (-1 for silent) [0]
* i=X : Sets interactivity (-1 for full auto) [0]
* log=FILE : Redirect log to FILE [Default = calling_program.log]
* newlog=T/F : Create new log file. [Default = False: append log file]
* silent=T/F : If set to True will not write to screen or log. [False]
* errorlog=FILE : If given, will write errors to an additional error file. [None]
help : Print help to screen

Program-Specific Commands: (Some programs only)
* basefile=FILE : This will set the 'root' filename for output files (FILE.*), including the log
* outfile=FILE : This will set the 'root' filename for output files (FILE.*), excluding the log
* delimit=X : Sets standard delimiter for results output files [\t]
* mysql=T/F : MySQL output
* append=T/F : Append to results files rather than overwrite [False]
* force=T/F : Force to regenerate data rather than keep old results [False]
* backups=T/F : Whether to generate backup files (True) or just overwrite without asking (False) [True]
* maxbin=X : Maximum number of trials for using binomial (else use Poisson) [-]

System Commandline:
* win32=T/F : Run in Win32 Mode [False]
pwin : Run in PythonWin (** Must be 'commandline', not in ini file! **)
cerberus : Run on Cerberus cluster at RCSI
* memsaver=T/F : Some modules will have a memsaver option to save memory usage [False]
* runpath=PATH : Run program from given path (log files and some programs only) [path called from]
* rpath=PATH : Path to installation of R ['c:\\Program Files\\R\\R-2.6.2\\bin\\R.exe']
* soaplab=T/F : Implement special options/defaults for SoapLab implementations [False]

Forking Commandline:
* noforks=T/F : Whether to avoid forks [False]
* forks=X : Number of parallel sequences to process at once [0]
* killforks=X : Number of seconds of no activity before killing all remaining forks. [3600]

Classes:
RJE_Object(log=None,cmd_list=[]):
- Metclass for inheritance by other classes.
>> log:Log = rje.Log object
>> cmd_list:List = List of commandline variables
On intiation, this object:
- sets the Log object (if any)
- sets verbosity and interactive attributes
- calls the _setAttributes() method to setup class attributes
- calls the _cmdList() method to process relevant Commandline Parameters
Log(itime=time.time(),cmd_list=[]):
- Handles log output; printing to log file and error reporting
>> itime:float = initiation time
>> cmd_list:list of commandline variables
Info(prog='Unknown',vers='X',edit='??/??/??',desc='Python script',author='Unknown',ptime=None):
- Stores intro information for a program.
>> prog:str = program name
>> vers:str = version number
>> edit:str = last edit date
>> desc:str = program description
>> author:str = author name
>> ptime:float = starting time of program, time.time()
Out(cmd=[]):
- Handles basic generic output to screen based on Verbosity and Interactivity for modules without classes.
>> cmd:list = list of command-line arguments

Uses general modules: glob, math, os, random, re, string, sys, time, traceback

rje Module Version History

    # 0.0 - Initial Compilation based on RJE_General01.plx. Simplified Class Names
    # 0.1 - Added comments and changed capitilisation etc.
    # 0.2 - Added RJE_Object class and 'No Log' concept
    # 0.3 - Updated RJE_Object to not need Out object (restrict Out Object to Log object?)
    # 1.0 - Better Documentation to go with GASP V:1.2
    # 1.1 - Added make Path and sorted ErrorLog
    #       def errorLog(self, text='Missing text for errorLog() call!',quitchoice=False,printerror=True):
    # 1.2 - Added Norman's error-tracking and 'basefile/outfile' options as Object defaults
    # 2.0 - Major Overhaul of Module. Out, Log and RJE_Object will now inherit RJE_Object_Shell (see also changes below)
    # 2.1 - Added X="Y" options
    # 2.2 - Added confirm=T/F option to choice()
    # 2.3 - Added more delimited text functions
    # 2.4 - Added listFromCommand() function
    # 2.5 - Added Force opt
    # 3.0 - Added self.list and self.dict dictionaries to RJE_Object. Added 'list', 'clist' and 'glist' to cmdRead.
    # 3.1 - Added object save and load methods
    # 3.2 - Added more refined path options
    # 3.3 - Added general delimitedFileOut() method based on slim_pickings.py compileOut()
    # 3.4 - Added cmdReadList for handling lists of "standard" options
    # 3.5 - Added getFileName for interactive input of file names with confirmation/existence options and checkInputFiles()
    # 3.6 - Added dataDict() method for extracting data from delimited file into dictionary
    # 3.7 - Added extra dictionary methods for storage and retrieval of results in a dict['Data'] dictionary
    # 3.8 - Gave module a general tidy up.
    # 3.9 - Added RunPath to control where program is run.
    # 3.10- Added errorlog=FILE option to redirect errors to an additional error file.
    # 3.11- Added dictionary ranking method.
    # 3.12- Added more list methods.
    # 4.0 - Added RJE_ObjectLite for data storage objects with minimal generic methods and attributes.

Info Class

    Stores intro information for a program.
    >> program:str = program name
    >> version:str = version number
    >> last_edit:str = last edit date
    >> description:str = program description
    >> author:str = author name
    >> start_time:float = starting time of program, time.time()
Info.__init__(self,prog='Unknown',vers='X',edit='??/??/??',desc='Python script',author='Unknown',ptime=None,copyright='2007',comments=[])
    Stores intro information for a program.
    >> prog:str = program name
    >> vers:str = version number
    >> edit:str = last edit date
    >> desc:str = program description
    >> author:str = author name
    >> ptime:float = starting time of program, time.time()
    >> copyright:str = year of copyright

Log Class

Class to handle log output: printing to log file and error reporting.
Log.__init__(self,itime=time.time(),cmd_list=[])
    Handles log output; printing to log file and error reporting.
    >> itime:float = initiation time
    >> cmd_list:list of commandline variables
Log._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
Log.errorLog(self, text='Missing text for errorLog() call!',quitchoice=False,printerror=True,nextline=True,log=True,errorlog=True)
    Raises text as error and prints to log.
    >> text:str = Error Description Text to print to log.
    >> quitchoice:bool = whether to give user choice to terminate program prematurely.
    >> printerror:bool = whether to print the system error. (Only if no return given.)
    >> nextline:bool [True] = whether to print error on next line
Log.myRunTime(self, secs)
    Converts time in seconds into time for easy comprehension
    >> secs:int = time in seconds
    << str = hours:minutes:seconds
Log.name(self)


Log.printLog(self, id='#ERR', text='Log Text Missing!', timeout=True, screen=True, log=True, newline=True, error=False)
    Prints text to log with or without run time.
    >> id:str = identifier for type of information
    >> text:str = log text
    >> timeout:boolean = whether to print run time
    >> screen:boolean = whether to print to screen (v>=0)
    >> log:boolean = whether to print to log file
    >> newline:boolean = whether to add newline if missing [True]

Out Class

Class to handle basic generic output to screen based on Verbosity and Interactivity outside of Classes.
Out.SafeprintIntro(self,info)
    Prints introductory program information to screen.
    >> info:Info Object
Out.printIntro(self,info)
    Prints introductory program information to screen.
    >> info:Info Object

RJE_Object Class

Metaclass for inheritance by other classes.
RJE_Object._activeForks(self,pidlist=[])
    Checks Process IDs of list and returns list of those still running.
    >> pidlist:list of integers = Process IDs
RJE_Object._editChoice(self,text,value,numeric=False,boolean=False)
    Returns Current or New Value as chosen.
    >> text:str = Text to dislay for choice
    >> value:str/int/float/bool = Existing value
    >> numeric:bool = whether a numeric value is wanted
    >> boolean:bool = whether a True/False value is wanted
RJE_Object._forkCmd(self,cmd='')
    Sets attributes according to commandline parameters:
    - see rje.__doc__ or run with 'help' option
RJE_Object._setDefaults(self,info='None',opt=False,stat=0.0,obj=None,setlist=False,setdict=False)
    Sets default defaults.
    >> info:str = default info setting
    >> opt:bool = default opt setting
    >> stat:float = default stat setting
    >> obj:Object = default obj setting
    >> setlist:boolean = whether to set list attributes [False]
    >> setdict:boolean = whether to set dictionary attributes [False]
RJE_Object.addLoadObj(self,object_type=None,log=None,cmd_list=[])
    Returns a new object of the right type.
    >> object_type:str = Code from object used to identify correct object to create.
    >> log:Log object to feed new object
    >> cmd_list:List of commands for new object
RJE_Object.attDetails(self,types=['All'],printblanks=True)
    Returns object details as text.
RJE_Object.baseFile(self,newbase=None)


RJE_Object.basefile(self,newbase=None)


RJE_Object.checkInputFiles(self,ilist,ask=True)
    Checks for existence of Input Files and asks for them if missing. Raises error if not interactive and missing.
    >> ilist:list of keys for self.info that should be sequence files
    >> ask:boolean = whether to ask for file if missing and i >= 0
RJE_Object.details(self)
    Returns object details as text.
RJE_Object.edit(self)
    Options to change all object details (Info, Stat, Opt) and associated objects' details too.
    #!# Lists and Dictionaries not included. #!#
RJE_Object.errorLog(self, text='Missing text for errorLog() call!',quitchoice=False,printerror=True,nextline=True,log=True,errorlog=True)


RJE_Object.gUnzip(self,file,log=True)
    Unzips a given file.
RJE_Object.gZip(self,filename,log=True,unlink=True)
    Zips file, if appropriate.
RJE_Object.getAtt(self,type,key,default=None)


RJE_Object.getAttribute(self,type,key,default=None)
    Gets object information of correct type.
    >> type:str = 'info','list','dict','opt','stat' or 'obj'
    >> key:str = key for given attribute dictionary
    >> default:anything = what to return if type is wrong or key not found in type
RJE_Object.getBool(self,ikey,default=False)


RJE_Object.getData(self,dkey,dlist=['stat','info','opt'],case=False,str=True,default=None,dp=-1)
    Gets data from dict['Data'] if it exists and has key, else tries dlist dictionaries.
    >> dkey:str = Key for dictionaries
    >> dlist:list [self.stat,self.info,self.opt] = list of dictionaries to try after dict['Data']
    >> case:bool [False] = whether to match case for dkey
    >> str:bool [True] = whether to return all values as a string
    >> default [None] = what to return if no dictionary has key
    >> dp:int [-1] = Number of decimal places to use for stats if str=True (-1 = return as is)
    << returns value from data or stat/info/opt
RJE_Object.getDict(self,dkey,dkeykey,default=None)
    Returns value of self.dict[dkey][dkeykey] else default.
RJE_Object.getInfo(self,ikey,default='None',checkdata=True)
    Gets object information or returns default if not found.
    >> ikey:str = key for self.info dictionary
    >> default:str = values returned if ikey not found.
    >> checkdata:boolean = whether to check self.dict['Data'] if missing
RJE_Object.getInt(self,ikey,default=0,checkdata=True)


RJE_Object.getNum(self,ikey,default=0.0,checkdata=True)


RJE_Object.getOpt(self,okey,default=False)
    Gets object opt or returns default.
RJE_Object.getStat(self,skey,default=0,checkdata=True)
    Gets object stat or returns default if not found.
    >> skey:str = key for self.stat dictionary
    >> default:num = values returned if ikey not found.
    >> checkdata:boolean = whether to check self.dict['Data'] if missing
RJE_Object.getStr(self,ikey,default='None',checkdata=True)


RJE_Object.loadData(self,type=None,key=None,data=None)
    Loads a single piece of data.
    >> type:str = Info/Opt/Stat
    >> key:Key for relevant data type
    >> data:Actual data
RJE_Object.loadFromFile(self,filename=None,v=0,checkpath=True,chomplines=False)
    Loads all data from file and returns readlines() list. Will look in same directory and self.info['Path']
    >> filename:str = Name of file
    >> v:int = verbosity setting for loading message
    >> checkpath:boolean [True] = whether to check in self.info['Path'] if filename missing.
    >> chomp:boolean [False] = whether to remove \\r and \\n from lines
    << filelines:list of lines from file
RJE_Object.loadSelf(self,filename=None)
    Loads a saved dump of own basic data (info,stat,opt) into file. Superceded by pickling.
    >> filename:str = name of file
RJE_Object.loadSpecial(self,line='')
    Loads a single piece of data.
    >> line:str = Special data to load
RJE_Object.needToRemake(self,checkfile,parentfile,checkdate=None,checkforce=True)
    Checks whether checkfile needs remake.
    >> checkfile:str = File name of file that may need remaking.
    >> parentfile:str = File name of file that was used to make checkfile. (Should be older)
    >> checkdate:bool = whether to bother checking the comparative dates.
    >> checkforce:bool = whether to use self.force() to identify whether remake should be forced
RJE_Object.printLog(self, id='#ERR', text='Log Text Missing!', timeout=True, screen=True, log=True, newline=True)


RJE_Object.prog(self)


RJE_Object.progLog(self, id='#ERR', text='Log Text Missing!',screen=True)


RJE_Object.saveData(self,OUTFILE,skiplist=[])
    Saves a dump of basic data into OUTFILE handle.
    >> OUTFILE:open file handle for writing
    >> skiplist:list of things to skip output for.
    - 'info','opt','stat','list','dict','obj'
RJE_Object.saveSelf(self,filename=None,append=False)
    Saves a dump of own basic data (info,stat,opt) into file. Superceded by pickling.
    >> filename:str = name of file (else self name.object.txt)
    >> append:boolean = whether to append filename or write anew
RJE_Object.selfLoadTidy(self,obj_dict)
    Tidies up unusual data types, such as dictionaries.
    >> obj_dict:Dictionary of object codes and objects
RJE_Object.setAttribute(self,type,key,newvalue)
    Gets object information of correct type.
    >> type:str = 'info','list','dict','opt','stat' or 'obj'
    >> key:str = key for given attribute dictionary
    >> newvalue:str = string version of new value
RJE_Object.setBasefile(self,basefile=None,cascade=True)
    Sets basefile and cascades to daughter objects.
RJE_Object.setBool(self,attdic,addtolist=True)


RJE_Object.setDict(self,dictdic,addtolist=True)
    Sets object dictionary attributes.
    >> dictdic:dictionary with keys corresponding to self.dict keys
    >> addtolist:boolean = whether to add to self.dictlist if missing
RJE_Object.setDictData(self,datadict,dictkey='Data')
    Sets object Data dict values.
    >> datadict = Dictionary of values to add to self.dict[dictkey]
RJE_Object.setInfo(self,infodic,addtolist=True)
    Sets object information.
    >> infodic:dictionary with keys corresponding to self.info keys
    >> addtolist:boolean = whether to add to self.infolist if missing
RJE_Object.setInt(self,attdic,addtolist=True)


RJE_Object.setList(self,listdic,addtolist=True)
    Sets object list attributes.
    >> listdic:dictionary with keys corresponding to self.list keys
    >> addtolist:boolean = whether to add to self.listlist if missing
RJE_Object.setLog(self,log,cascade=True)
    Sets given log as log object and cascades through self.obj.
RJE_Object.setNum(self,attdic,addtolist=True)


RJE_Object.setObj(self,objdic,addtolist=True)
    Sets object information.
    >> objdic:dictionary with keys corresponding to self.obj keys
    >> addtolist:boolean = whether to add to self.objlist if missing
RJE_Object.setOpt(self,optdic,addtolist=True)
    Sets object information.
    >> optdic:dictionary with keys corresponding to self.opt keys
    >> addtolist:boolean = whether to add to self.optlist if missing
RJE_Object.setStat(self,statdic,addtolist=True)
    Sets object information.
    >> statdic:dictionary with keys corresponding to self.stat keys
    >> addtolist:boolean = whether to add to self.statlist if missing
RJE_Object.setStr(self,attdic,addtolist=True)




RJE_ObjectLite Class

Metaclass for inheritance by other classes.
RJE_ObjectLite._editChoice(self,text,value,numeric=False,boolean=False)
    Returns Current or New Value as chosen.
    >> text:str = Text to dislay for choice
    >> value:str/int/float/bool = Existing value
    >> numeric:bool = whether a numeric value is wanted
    >> boolean:bool = whether a True/False value is wanted
RJE_ObjectLite._setDefaults(self,info='None',opt=False,stat=0.0,obj=None,setlist=False,setdict=False)
    Sets default defaults.
    >> info:str = default info setting
    >> opt:bool = default opt setting
    >> stat:float = default stat setting
    >> obj:Object = default obj setting
    >> setlist:boolean = whether to set list attributes [False]
    >> setdict:boolean = whether to set dictionary attributes [False]
RJE_ObjectLite._setGeneralAttributes(self)
    Sets general attributes for use in all classes.
RJE_ObjectLite.attDetails(self,types=['All'],printblanks=True)
    Returns object details as text.
RJE_ObjectLite.details(self)


RJE_ObjectLite.edit(self)
    Options to change all object details (Info, Stat, Opt) and associated objects' details too.
    #!# Lists and Dictionaries not included. #!#
RJE_ObjectLite.errorLog(self, text='Missing text for errorLog() call!',quitchoice=False,printerror=True,nextline=True,log=True,errorlog=True)


RJE_ObjectLite.getAtt(self,type,key,default=None)


RJE_ObjectLite.getAttribute(self,type,key,default=None)
    Gets object information of correct type.
    >> type:str = 'info','list','dict','opt','stat' or 'obj'
    >> key:str = key for given attribute dictionary
    >> default:anything = what to return if type is wrong or key not found in type
RJE_ObjectLite.getData(self,dkey,dlist=['stat','info','opt'],case=False,str=True,default=None,dp=-1)
    Gets data from dict['Data'] if it exists and has key, else tries dlist dictionaries.
    >> dkey:str = Key for dictionaries
    >> dlist:list [self.stat,self.info,self.opt] = list of dictionaries to try after dict['Data']
    >> case:bool [False] = whether to match case for dkey
    >> str:bool [True] = whether to return all values as a string
    >> default [None] = what to return if no dictionary has key
    >> dp:int [-1] = Number of decimal places to use for stats if str=True (-1 = return as is)
    << returns value from data or stat/info/opt
RJE_ObjectLite.getDict(self,dkey,dkeykey,default=None)
    Returns value of self.dict[dkey][dkeykey] else default.
RJE_ObjectLite.getInfo(self,ikey,default='None',checkdata=True)
    Gets object information or returns default if not found.
    >> ikey:str = key for self.info dictionary
    >> default:str = values returned if ikey not found.
    >> checkdata:boolean = whether to check self.dict['Data'] if missing
RJE_ObjectLite.getOpt(self,okey,default=False)
    Gets object opt or returns default.
RJE_ObjectLite.getStat(self,skey,default=0,checkdata=True)
    Gets object stat or returns default if not found.
    >> skey:str = key for self.stat dictionary
    >> default:num = values returned if ikey not found.
    >> checkdata:boolean = whether to check self.dict['Data'] if missing
RJE_ObjectLite.getStr(self,ikey,default='None',checkdata=True)


RJE_ObjectLite.printLog(self, id='#ERR', text='Log Text Missing!', timeout=True, screen=True, log=True, newline=True)


RJE_ObjectLite.prog(self)


RJE_ObjectLite.progLog(self, id='#ERR', text='Log Text Missing!',screen=True)


RJE_ObjectLite.setAttribute(self,type,key,newvalue)
    Gets object information of correct type.
    >> type:str = 'info','list','dict','opt','stat' or 'obj'
    >> key:str = key for given attribute dictionary
    >> newvalue:str = string version of new value
RJE_ObjectLite.setBool(self,attdic,addtolist=True)


RJE_ObjectLite.setDict(self,dictdic,addtolist=True)
    Sets object dictionary attributes.
    >> dictdic:dictionary with keys corresponding to self.dict keys
    >> addtolist:boolean = whether to add to self.dictlist if missing
RJE_ObjectLite.setDictData(self,datadict,dictkey='Data')
    Sets object Data dict values.
    >> datadict = Dictionary of values to add to self.dict[dictkey]
RJE_ObjectLite.setInfo(self,infodic,addtolist=True)
    Sets object information.
    >> infodic:dictionary with keys corresponding to self.info keys
    >> addtolist:boolean = whether to add to self.infolist if missing
RJE_ObjectLite.setInt(self,attdic,addtolist=True)


RJE_ObjectLite.setList(self,listdic,addtolist=True)
    Sets object list attributes.
    >> listdic:dictionary with keys corresponding to self.list keys
    >> addtolist:boolean = whether to add to self.listlist if missing
RJE_ObjectLite.setLog(self,log,cascade=True)
    Sets given log as log object and cascades through self.obj.
RJE_ObjectLite.setNum(self,attdic,addtolist=True)


RJE_ObjectLite.setObj(self,objdic,addtolist=True)
    Sets object information.
    >> objdic:dictionary with keys corresponding to self.obj keys
    >> addtolist:boolean = whether to add to self.objlist if missing
RJE_ObjectLite.setOpt(self,optdic,addtolist=True)
    Sets object information.
    >> optdic:dictionary with keys corresponding to self.opt keys
    >> addtolist:boolean = whether to add to self.optlist if missing
RJE_ObjectLite.setStat(self,statdic,addtolist=True)
    Sets object information.
    >> statdic:dictionary with keys corresponding to self.stat keys
    >> addtolist:boolean = whether to add to self.statlist if missing
RJE_ObjectLite.setStr(self,attdic,addtolist=True)




RJE_Object_Shell Class

Metaclass for inheritance by other classes. Forms shell for RJE_Object, Out and Log objects.
RJE_Object_Shell.__init__(self,log=None,cmd_list=[],parent=None)
    RJE_Object:
    > log:Log = rje.Log object
    > cmd_list:List = List of commandline variables

    On intiation, this object:
    - sets the Log object (can be None)
    - sets verbosity and interactive attributes
    - calls the _setAttributes() method to setup class attributes
    - calls the _cmdList() method to process relevant Commandline Parameters       
RJE_Object_Shell._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
RJE_Object_Shell._cmdRead(self,cmd=None,type='info',att=None,arg=None)
    Sets self.type[att] from commandline command cmd.
    >> type:str = type of attribute (info,path,opt,int,float,min,max,list,clist,glist,file)
    >> att:str = attribute (key of dictionary)
    >> arg:str = commandline argument[att.lower()]
    >> cmd:str = commandline command
RJE_Object_Shell._cmdReadList(self,cmd=None,type='info',attlist=[])
    Sets self.type[att] from commandline command cmd.
    >> cmd:str = commandline command
    >> type:str = type of attribute (info,path,opt,int,float,min,max,list,clist,glist,file)
    >> att:list of attributes (key of dictionary) where commandline argument is att.lower()  []
RJE_Object_Shell._generalCmd(self,cmd='')
    Sets general attributes according to commandline parameters:
    - see rje.__doc__ or run with 'help' option
RJE_Object_Shell._setAttributes(self)
    Sets Attributes of Object:
    - Info, Stat, Opt and Obj
RJE_Object_Shell._setForkAttributes(self)
    Sets general forking attributes for use in all classes.
RJE_Object_Shell._setGeneralAttributes(self)
    Sets general attributes for use in all classes.
RJE_Object_Shell.bugPrint(self,text)


RJE_Object_Shell.data(self,table,strict=False)


RJE_Object_Shell.db(self,table=None)


RJE_Object_Shell.deBug(self,text,pause=True)
    Prints text to screen if self.opt['DeBug'].
    >> text:str = Debugging text to print.
RJE_Object_Shell.debug(self,text)


RJE_Object_Shell.force(self)


RJE_Object_Shell.i(self)


RJE_Object_Shell.interactive(self)


RJE_Object_Shell.pickleMe(self,basefile=None,gzip=True,replace=True)
    Saves self object to pickle and zips.
    >> basefile:str [None] = if none, will use self.info['Basefile']
    >> gzip:bool [True] = whether to GZIP (win32=F only)
    >> replace:bool [True] = whether to replace existing Pickle
RJE_Object_Shell.processPickle(self,newme)
    Changes attributes accordingly. Replace this method in subclasses.
RJE_Object_Shell.replaceMe(self,newme)
    Replaces all my attributes with those of newme.
RJE_Object_Shell.unpickleMe(self,basefile=None,process=True)
    (Unzips and) loads pickle object and returns.
RJE_Object_Shell.v(self)


RJE_Object_Shell.vPrint(self,text,v=1)


RJE_Object_Shell.verbose(self,v=0,i=None,text='',newline=1)
    Prints text to screen if verbosity high enough. Pauses program if interactivity high enough.
    >> v:int = verbosity cut-off for statement
    >> i:int = interactivity cut-off for pause
    >> text:str = text to be printed
    >> newline:int = no. of newlines to follow text. (Pause counts as 1 newline)
RJE_Object_Shell.yesNo(self,text='',default='Y',confirm=False,i=0)




rje Module Methods

rje.OLDbinomial(observed,trials,prob,exact=False,usepoisson=True,callobj=None)
    Returns the binomial probability of observed+ occurrences.
    >> observed:int = number of successes (k)
    >> trials:int = number of trials (n)
    >> prob:int = probability of success of each trial
    >> usepoisson:bool = whether to use Poisson if Binomial fails [True]
rje.backup(callobj,filename,unlink=True,appendable=True)
    Checks for existence of file and gives backup options.
    >> callobj:Object calling the method (use for interactive/append)
    >> filename:str = filename to backup
    >> unlink:boolean [True] = whether to delete file if found
rje.baseFile(filename,strip_path=False,extlist=[])
    Returns file without extension, with or without path.
    >> filename:str = file to reduce to basefile
    >> strip_path:bool = whether to strip any path information [False]
    >> extlist:list of acceptable file extensions to remove []
    << basefile:str = returned filename base
rje.binComb(positions,cmin,cmax)
    Returns the number of binary combinations within certain ranges.
    >> positions:int = the number of positions that can be 1 or 0
    >> min:int = the min. no. of 1s to have
    >> max:int = the max. no. of 1s to have
rje.binaryCount(binlist)
    Adds one in binary and returns binlist.
    >> binlist:list of 0s and 1s
rje.binomial(observed,trials,prob,exact=False,usepoisson=True,callobj=None)
    Returns the binomial probability of observed+ occurrences.
    >> observed:int = number of successes (k)
    >> trials:int = number of trials (n)
    >> prob:int = probability of success of each trial
    >> usepoisson:bool = whether to use Poisson if Binomial fails [True]
rje.checkForFile(file)
    Returns True if file exists or False if not.
rje.choice(text='?: ',default='',confirm=False)
    Asks for a choice and returns input.
    >> default:str = default value given for blank entry ['']
rje.chomp(text)


rje.cleanDir(callobj=None,keepfiles=[],cleandir='',log=True)
    Cleanup directory by removing files not in list.
    >> callobj:Object = calling RJE Object
    >> keepfiles:list = Files to ignore and not delete
    >> cleandir:str [] = Directory to cleanup
rje.combineDict(targetdict,sourcedict,overwrite=True,replaceblanks=True,copyblanks=False)
    Adds data from sourcedict to targetdict (targetdict changes).
    >> targetdict:dictionary that will be altered
    >> sourcedict:dictionary containing data to add to targetdict
    >> overwrite:bool [True] = whether keys from sourcedict will overwrite same data in targetdict
    >> replaceblanks:bool [True] = whether to replace existing but empty targetdict entries
    >> copyblanks:bool [False] = whether to copy blank entries over existing target entries
rje.dataDict(callobj,filename,mainkeys=[],datakeys=[],delimit=None,headers=[],getheaders=False,ignore=[],lists=False,debug=False,enforce=False)
    Extracts data from delimited file into dictionary.
    >> callobj:RJE_Object controlling logs and error-handling
    >> filename:str = file to read from
    >> mainkeys:list = List of headers to be used as key for returned dictionary. If None, will use first header.
    >> datakeys = List of headers to be used as keys for data returned for each mainkey (all headers if [])
    >> delimit = string delimiter. If None, will identify from filename
    >> headers = List of headers to use instead of reading from first line or use datakeys
    >> getheaders:bool [False] = whether to add an extra 'Headers' dictionary value containing list of headers
    >> ignore:list = Leading strings for lines to ignore (e.g. #)
    >> lists:bool [False] = whether to return values as lists. (Otherwise, later entries will overwrite earlier ones)
    >> enforce:bool [False] = whether to enforce correct length of delimited lines [True] or [False] truncate/extend
rje.dateTime(t=(),yymmdd=False)
    Returns date time string given time tuple t (makes if empty).
rje.deleteDir(callobj,deldir,contentsonly=True,confirm=True,report=True)
    Deletes directory contents, including subdirectories. Use with care!!
    >> callobj:Object calling the method (use for interactive/append)
    >> deldir:str = path to directory to delete
    >> contentsonly:boolean [True] = whether to delete files within directory only or directory too
    >> confirm:boolean [True] = ask for confirmation before deleting files
    >> report:boolean [True] = whether to summarise deletion in log if callobj given
    << True if deleted, False if not, KeyboardInterrupt error if mind changed
rje.delimitExt(delimit)
    Returns file extension for text file with given delimiter.
rje.delimitFromExt(ext='',filename='',write=False)
    Returns default delimiter for given file extension.
rje.delimitedFileOutput(callobj,filename,headers,delimit=None,datadict={},rje_backup=False)
    Outputs a single line of compiled data to file.
    >> callobj:Object = calling object
    >> filename:str = Name of output file
    >> headers:list of field headers
    >> delimit:str = text delimiter. If None, will try to work out from filename and callobj
    >> datadict:dictionary of {header:str} = data to output. If none, will output headers themselves.
    >> rje_backup:boolean [False] = if no datadict and true, will call rje.backup(callobj,filename,unlink=True)
rje.delimitedObjDataOutput(callobj,filename,headers,delimit=None,dpdict={})
    Outputs object data as single delimited line.
    >> callobj:Object containing data in obj.dict['Data'], obj.stat, obj.opt and obj.info
    >> filename:str = Name of output file
    >> headers:list of field headers. Should correspond to obj.getData() keys
    >> delimit:str = text delimiter. If None, will try to work out from filename and callobj
    >> dpdict:dictionary of dp settings (integers) for obj.getData() and corresponding lists of headers. (All else -1)
rje.dictFreq(dict,total=True,newdict=False)
    Normalises values of dict by total. Adds 'Total' key if desired (total=True). Must be numeric values!
rje.dictKeysSortedByValues(dict,revsort=False)
    Returns dictionary keys sorted according to corresponding values.
rje.dictValues(dict,key,valtype='list')
    Returns dict values or empty list if dict does not have key.
rje.dp(data,dp)
    Returns number rounded to X dp.
rje.eStr(_expect,strict=True)


rje.entropyDict(data,ikeys=[],fillblanks=True)
    Calculate entropy of input dictionary.
    >> data:dict = input data dictionary {key:numeric}
    >> ikeys:list = keys for entropy calculation. Uses data.keys() if []
    >> fillblanks:bool [True] = whether to fill in zero values for missing keys
rje.exists(file)
    Returns True if file exists or False if not.
rje.expectString(_expect,strict=True)
    Returns formatted string for _expect value.
rje.factorial(m,callobj=None)
    Returns the factorial of the number m.
rje.fileLineFromSeek(FILE=None,pos=0,reseek=False,next=False)
    Returns full line & new pos for seek position pos.
    >> FILE:Open file object for reading
    >> pos:int [0] = position to be within line
    >> reseek:boolean [False] = seek to new position before returning
    >> next:boolean [False] = return next line, rather than line containing pos
    << returns (line(str),pos(long))
rje.fileTransfer(fromfile=None,tofile=None,deletefrom=True,append=True)
    Appends fromfile to tofile and deletes fromfile.
    >> fromfile:str = name of file to be copied and deleted
    >> tofile:str = name of file to be deleted
    >> deletefrom:bool = whether fromfile to be deleted [True]
    >> append:bool = whether to append tofile [True]
rje.formula(callobj=None,formula='',data={},varlist=[],operators=[],check=False,calculate=True)
    Calculates formula using data dictionary, restricting to varlist and operators if desired. This calculation is
    executed in lower case, so varlist, data and formula need not match case. However, this obviously means that case-
    sensitive variables cannot be used.
    >> callobj:Object [None] = calling object, used for error messages if given.
    >> formula:str [''] = Formula as a string. Will be split on operators. Can have variables in data or numbers.
    >> data:dict {} = Dictionary of {variable:value}, where values should be numbers.
    >> varlist:list [] = List of data.keys() to include in calculation (in case some are strings etc.)
    >> operators:list [] = List of restricted operators. Will use ()+-*/^ if none given. (Other brackets replaced.)
    >> check:boolean [False] = Whether to check for any variables missing from data.keys()/varlist
    >> calculate:boolean [True] = Whether to calculate result of formula and return
    << value:float = results of calculation
rje.geoMean(numlist=[])
    Returns geometric mean of numbers in list.
rje.getBool(text='Numerical Value?:',default=False,confirm=False)
    Asks for a choice and returns boolean.
    >> text:str = Prompt Text
    >> default:bool = Default value
rje.getCmdList(argcmd,info=None)
    Converts arguments into list of commands, reading from ini file as appropriate.
    >> argcmd:list = list of commands from commandline separated by whitespace (typically sys.argv[1:])
    >> info:Info Object [None] = if given, will try to read defaults from 'info.program.ini'
    << cmd_list:list = list of commands!
rje.getDelimit(cmd_list=[],default='\t')
    Returns delimit from command list.
rje.getFileList(callobj=None,folder=os.getcwd(),filelist=['*'],subfolders=True,summary=True,filecount=0,asksub=False)
    Returns a list of files with appropriate filenames.
    >> callobj:RJE_Object = object used for verbosity etc. (if any) [None]
    >> folder:str = folder to start looking in (for os.listdir(folder)) [Current directory]
    >> filelist:list of files ['*']
    >> subfolders:bool = whether to also look in subfolders
    >> summary:bool [True] = whether to print output summary
    >> filecount:int [0] = running total of files so far for progress reporting
    >> asksub:bool [False] = whether to ask for confirmation before scanning subdir
    << globlist:list of strings = paths to files from current directory
rje.getFileName(text='File Name?',default='',mustexist=True,confirm=False)
    Asks for a filename as part of interactive menus etc.
    >> text:str = Text to print as prompt for file name
    >> default:str = Default file name
    >> mustexist:boolean = Whether the file must exist (ask again if not)
    >> confirm:boolean = Whether the user should confirm the new filename
    << filename:str = file name entered by user
rje.getFloat(text='Numerical Value?:',default='0.0',confirm=False)
    Asks for a choice and returns float.
    >> text:str = Prompt Text
    >> default:str = Default value as string
rje.getFromDict(dict,key,returnkey=True,case=True,default=None)
    Returns dict value if it has key, else given key (or None).
    >> dict:dictionary from which to get value
    >> key:dictionary key
    >> returnkey:bool [True] = whether to return the key itself if not found in dictionary
    >> case:bool [True] = whether to use case-sensitive key matching
    >> default [None] = what to return if no entry
rje.getInt(text='Integer Value?:',blank0=False,default='0',confirm=False)
    Asks for a choice and returns integer.
    >> text:str = Prompt Text
    >> default:str = Default value as string
rje.iLen(inlist)


rje.iStr(number)


rje.ignoreLine(fileline,ignorelist)
    Returns whether line should be ignored.
rje.iniCmds(ini_path,ini_file,iowarning=True)
    Reads a list of commands from an inifile
    >> ini_path:str = filepath. 
    >> ini_file:str = filename. Add ini_path if not found in current directory
    >> iowarning:boolean [True] = whether to warn and kill if *.ini missing.
    << inicmds:list = list of commands
rje.inputCmds(cmd_out,cmd_list)
    Reads extra commands from user prompt
    >> cmd_out:Out 
    >> cmd_list:list = current list of commands
    << newcmd:list = new list of commands
rje.integerString(number)
    Returns a string with commas for long integer, e.g. 1,000.
    >> number:integer to be returned as string
    << intstring
rje.isOdd(num)
    Returns True if Odd or False if Even.
rje.isYounger(file1,file2)
    Compares age of files. file2 should be desired new file. file1 is returned in the case of a tie.
    Returns:
    - younger file, or
    - None if either does not exist, or
    - First file if of same age
rje.listCombos(inlist)
    Returns a list of all possible combinations of inlist entries.
    E.g. [AB,D,EF] will return [ [A,D,E], [A,D,F], [B,D,E], [B,D,F] ]
rje.listDifference(list1,list2)


rje.listFromCommand(command,checkfile=True)
    Returns a list object from a given command string. Reads list from file if found. If not, will split on commas.
    >> command:string
    >> checkfile:boolean = whether to check for presence of file and read list from it
    << comlist:list
rje.listIntersect(list1,list2)


rje.listRearrange(inlist)
    Returns all possible orders of letters as list.
rje.listUnion(list1,list2)


rje.logBinomial(observed,trials,prob,exact=False,callobj=None)
    Returns the binomial probability of observed+ occurrences.
    >> observed:int = number of successes (k)
    >> trials:int = number of trials (n)
    >> prob:int = probability of success of each trial
    >> usepoisson:bool = whether to use Poisson if Binomial fails [True]
rje.logFactorial(m,callobj=None)
    Returns the factorial of the number m.
rje.logPoisson(observed,expected,exact=False,callobj=None)
    if exact=True.'''
    ### ~ [0] Setup ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
    if observed <= 0 and (expected == 0 or not exact): return 1.0
    if expected < 0:
        if callobj: callobj.errorLog('Warning: LogPoisson expected < 0',printerror=False)
        expected = 0
    expected = float(expected)
    if expected == 0: return 0.0  # Already handled observed=0 case
    ### ~ [1] Exact probability of observed given expected ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
    if exact:
        try: return math.exp((math.log(expected)*observed) - expected - logFactorial(observed,callobj))
        except:
            if callobj: callobj.errorLog('LogPoisson error: will return p = 0.0')
            return 0.0
    ### ~ [2] Cumulative probability of observed+ given expected ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
    prob = 1.0
    for x in range(0,observed):
        try: prob -= math.exp((math.log(expected)*x) - expected - logFactorial(x,callobj))
        except KeyboardInterrupt: raise
        except:
            if callobj: callobj.errorLog('LogPoisson error: will return p = %.3f' % (prob))
            break
    if prob >= 0: return prob
    else: return 0.0
#########################################################################################################################
rje.longCmd(cmd_list)
    Extracts long command from command list and returns altered command list.
    cmd_list:list of commandline arguments
rje.makePath(path='',wholepath=False,return_blank=True)
    Returns path that can be used for calling programs etc.
    >> path:str = Given path with directory separators as '/'
    >> wholepath:boolean = whether path includes the program call (True) or not (False)
    >> return_blank:boolean [True] = whether to return '' if given or replace with '.'
    << os_path:str = Returned path with appropriate separators
rje.matchExp(re_pattern, text)
    Returns matched groups or None if no match.
    >> re_pattern:str = regular expression
    >> text:str = string to match
rje.meansd(numlist)
    Returns (mean,standard deviation) for a list of numbers.
rje.meanse(numlist)
    Returns (mean,standard error) for a list of numbers.
rje.mkDir(callobj,newdir,log=False)
    Makes directory and necessary parent directories.
rje.modulus(num)
    Returns modulus of number.
rje.nextLine(FILEHANDLE=None,strip=True)
    Returns next line or None if end of file.
    >> FILEHANDLE:File object
    >> strip:boolean = whether to strip \r and \n from line [True]
rje.objFactorial(callobj,m)
    Returns the factorial of the number m.
rje.perCount(num,max,steps,prints)
    Returns string to print (if any).
    >> num:int = current number
    >> max:int = max number
    >> steps:int = number of counts at which to print '.'
    >> prints:int = number of counts at which to print 'X.X%'
rje.perCounter(countlist)
    Counter for functions. Adds one to 
    >> countlist = list of counter parameters:
    .. (sloop,ploop,total,max,steps,prints)       
        > sloop:int = current number of step loop
        > ploop:int = current number of print loop
        > total:int = total count
        > max:int = max number
        > steps:int = steps at which to print '.'
        > prints:int = steps at which to print 'X%'
        > verbosity:int = verbosity level of object
    Setup: perc = [0,0,0,max,steps,prints,verbosity], e.g. [0,0,0,x,100,1000,verbosity]
    .. use rje.setPerc(max,steps,prints,verbosity)
    << countlist
rje.poisson(observed,expected,exact=False,callobj=None,uselog=True)
    if exact=True.'''
    ### Exact ###
    if uselog:
        try: return logPoisson(observed,expected,exact,callobj)
        except: pass 
    expected = float(expected)
    if exact:
        try: return (math.exp(-expected) * pow(expected,observed) / factorial(observed,callobj))
        except:
            if callobj: callobj.errorLog('Poisson error: will return p = 0.0')
            return 0.0
    ### Cumulative ###
    prob = 1.0
    for x in range(0,observed):
        try:        #!# Fudge for OverflowError: long int too large to convert to float
            prob -= (math.exp(-expected) * pow(expected,x) / float(factorial(x,callobj)))
            #X#print x, prob
        except KeyboardInterrupt: raise
        except:
            if callobj: callobj.errorLog('Poisson error: will return p = %.3f' % (prob))
            #X#raise
            break
    if prob >= 0: return prob
    else: return 0.0
#########################################################################################################################
rje.posFromIndex(target,INDEX,start_pos=0,end_pos=-1,re_index='^(\S+)=',sortunique=False,xreplace=True)
    Returns position in file from which to extract target using fileLineFromSeek(INDEX,pos).
    NB. Index should be sorted and every line from start_pos on should match re_index.
    >> target:str = target string from INDEX using re_index.
    >> INDEX:filehandle open for reading.
    >> start_pos:int [0] = position in file to start looking
    >> end_pos:int [0] = position at end of file (seek(0,2)->tell())
    >> re_index:str ['^(\S+)='] = regular expression to use to identify match to target
    >> sortunique:boolean [False] = whether to attempt to match the UNIX sort as best possible (dodgy)
    >> xreplace:boolean [True] = whether to replace dots with xs for determining sort 
    << returns ipos for use with fileLineFromSeek, else -1 if not found.
rje.preZero(num,max)
    Adds trailing zeros to integer values for output of equal length
    >> num:int = number to be extended
    >> max:int = maximum number in set (defines 'length' of output number)
    << prezero:str = string of num with leading zeros
rje.progressPrint(callobj,x,dotx=1000,numx=10000,v=0)
    Prints a dot or a number to screen dependent on numbers.
    >> callobj:RJE_Object = object controlling verbosity
    >> x:int = current count
    >> dotx:int = count for a printed dot
    >> numx:int = count for a printed number
    >> v:int = verbosity level for output
rje.randomList(inlist)
    Returns inlist in randomised order.
    >> inlist:List object
    << ranlist:randomised order list
rje.randomString(length)
    Returns a random string of given length.
    >> length:int = length of string to return
rje.rankDict(data,rev=False,absolute=False,lowest=False)
    Returns rank of values as new dictionary.
    >> data:dict = input dictionary of scores.
    >> rev:Boolean = if True will return 0 for Highest
    >> absolute:boolean [False] = return 1 to n, rather than 0 to 1
    >> lowest:boolean [False] = returns lowest rank rather mean rank in case of ties
    << ranklist:list of ranks (0 = Lowest, 1 = Highest)
rje.rankList(scorelist=[],rev=False,absolute=False,lowest=False,unique=False)
    Returns rank of scores as list.
    >> scorelist = list of scores
    >> rev:Boolean = if True will return 0 for Highest
    >> absolute:boolean [False] = return 1 to n, rather than 0 to 1
    >> lowest:boolean [False] = returns lowest rank rather mean rank in case of ties
    >> unique:boolean [False] = give each element a unique rank (ties rank in order of entry)
    << ranklist:list of ranks (0 = Lowest, 1 = Highest)
rje.readDelimit(line='',delimit='\t')
    Reads a line into a list of strings.
    >> line:string object line from input file 
    >> delimit:text limiter for file [tab]
    << returns list of strings
rje.regExp(re_object, text)
    Returns matched groups.
    >> re_object:re Object = re.compiled regular expression
    >> text:str = string to match
rje.scaledict(dict={},scale=1.0)
    Scales all values by scale and returns new dictionary.
rje.setLog(info,out,cmd_list)
    Makes Log Object and outputs general program run info to start of file. Returns log.
    >> info:rje.Info object containing program info
    >> out:rje.Out object controlling output to screen
    >> cmd_list:list = full list of commandline options
    << log:Log object
rje.setPerc(max,steps,prints,verbosity)
    Set up countlist for perCounter.
    >> max:int = max number
    >> steps:int = steps at which to print '.'
    >> prints:int = steps at which to print 'X%'
    >> verbosity:int = verbosity level of object
rje.sortKeys(dic,revsort=False)
    Returns sorted keys of dictionary as list.
    >> dic:dictionary object
    >> revsort:boolean = whether to reverse list before returning
rje.sortListsByLen(listoflists,rev=False)


rje.sortUnique(inlist,xreplace=True,num=False)
    Returns sorted unique list.
    >> inlist:List to be sorted
    >> xreplace:boolean [True] = whether to replace dots with xs for determining sort
    >> num:boolean [False] = whether the sortlist is numbers rather than strings
rje.strList(text)
    Returns string as list.
rje.strRearrange(text)
    Returns all possible orders of letters as list.
rje.strReplace(strtext,deltext,newtext='',case_sens=True,allocc=False,max=1)
    Replaces first (or all) occurrences of deltext within strtext with newtext. Note that this does not use regular
    expressions and looks for exact matches only. Use re.sub() for regular expression replacements.
    >> strtext:str = text in which replacements to take place
    >> deltext:str = text to be replaced
    >> newtext:str = replacement text ['']
    >> case_sens:bool = whether to replace in case-sensitive manner [True]
    >> allocc:bool = whether to replace all occurrences [False]
    >> max:int = maximum number of replacement (if allocc=False) [1]
    << returns a tuple of (strtext,count), where strtext has replacements made and count is number of replacements.
rje.strReverse(text)
    Returns reversed string.
rje.strSort(text,unique=False)
    Returns sorted string.
rje.strSub(instr,start,end,sub)
    Replaces part of string and returns. Start and end are inclusive.
    >> string:str = full string
    >> start:int
    >> end:int
    >> sub:str = substitution
rje.stringStrip(instr,striplist)
    Returns string with striplist strings removed.
rje.subDir(pathname,exclude=[])
    Returns the subdirectories given by glob.glob()
    >> pathname:pathname for glob.glob()
    >> exclude:list of directories to leave out of list
rje.valueSortedKeys(data,rev=False)
    Returns list of keys, sorted by values.
rje.writeDelimit(OUTFILE=None,outlist=[],delimit='\t',outfile=None)
    Writes given list of strings to file (if given) or just returns string.
    >> OUTFILE:file handle = output file handle [None]
    >> outlist:list of string objects to write to file []
    >> delimit:text limiter for file [tab]
    >> outfile:string = name of file to append (use if OUTFILE not given)
    << returns the output string
rje.yesNo(text='',default='Y',confirm=False)
    Asks for yes or no and returns True or False.

rje Module ToDo Wishlist

    # [Y] : Split general functions into groups, like delimited text functions

rje_aaprop [version 0.1] AA Property Matrix Module ~ [Top]

Module: rje_aaprop
Description: AA Property Matrix Module
Version: 0.1
Last Edit: 18/05/06
Imports: rje
Imported By: badasp, presto_V5, peptide_dismatrix, peptide_stats, rje_motiflist, rje_slimcalc
Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice

Function:
Takes an amino acid property matrix file and reads into an AAPropMatrix object. Converts in an all by all property
difference matrix. By default, gaps and Xs will be given null properties (None) unless part of input file.

Commandline:
* aaprop=FILE : Amino Acid property matrix file. [aaprop.txt]
* aagapdif=X : Property difference given to amino acid vs gap comparisons [5]
* aanulldif=X : Property difference given to amino acid vs null values (e.g. X) [0.5]

Uses general modules: re, string, sys, time
Uses RJE modules: rje

AAPropMatrix Class

    Amino Acid Property Matrix Class. Author: Rich Edwards (2005).

    Info:str
    - Name = Name of file from which data is loaded
    
    Opt:boolean

    Stat:numeric
    - GapDif = Property difference given to amino acid vs gap comparisons [5]
    - NullDif = Property difference given to amino acid vs null values (e.g. X) [0.5]

    Obj:RJE_Objects
AAPropMatrix._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
AAPropMatrix._setAttributes(self)
    Sets Attributes of Object:
    - Info:str ['Name']
    - Stats:float ['GapDif','NullDif']
    - Opt:boolean []
    - Obj:RJE_Object []
AAPropMatrix.makePropDif(self)
    Converts the property matrix into a property difference matrix.
AAPropMatrix.readAAProp(self,filename=None)
    Reads AA Property Matrix from file.
    >> filename:str = Filename. If None, will use self.info['Name']
AAPropMatrix.saveAAProp(self,filename=None)
    Saves AA Property Matrix to file.
    >> filename:str = Filename. If None, will use self.info['Name']
AAPropMatrix.savePropDif(self,filename='aapdif.txt')
    Saves AA Property Difference Matrix to file.
    >> filename:str = Filename. If None, will use self.info['Name']
AAPropMatrix.useAlphabet(self,alphabet,missing=None,trim=False)
    Makes sure matrix is using supplied alphabet (no missing values to cause errors).
    >> alphabet:list = single-letter codes to ues.
    >> missing:num = values to give properties currently missing a given letter.
    >> trim:boolean = whether to delete parts of the property matrix that are not in given alphabet

rje_aaprop Module Methods

rje_aaprop.runMain()




rje_ancseq [version 1.2] Ancestral Sequence Prediction Module ~ [Top]

Module: rje_ancseq
Description: Ancestral Sequence Prediction Module
Version: 1.2
Last Edit: 08/01/07
Imports: rje, rje_pam
Imported By: badasp, gasp, haqesac, rje_tree
Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice

Function:
This module contains the objects and methods for ancestral sequence prediction. Currently, only GASP (Edwards & Shields
2004) is implemented. Other methods may be incorporated in the future.

GASP Commandline:
* fixpam=X\t: PAM distance fixed to X [0].
* rarecut=X\t: Rare aa cut-off [0.05].
* fixup=T/F\t: Fix AAs on way up (keep probabilities) [True].
* fixdown=T/F\t: Fix AAs on initial pass down tree [False].
* ordered=T/F\t: Order ancestral sequence output by node number [False].
* pamtree=T/F\t: Calculate and output ancestral tree with PAM distances [True].
* desconly=T/F\t: Limits ancestral AAs to those found in descendants [True].
* xpass=X\t: How many extra passes to make down & up tree after initial GASP [1].

Classes:
Gasp(log=None,cmd_list=[],tree=None,ancfile='gasp'):
- Handles main GASP algorithm.
>> log:Log = rje.Log object
>> cmd_list:List = List of commandline variables
>> tree:Tree = rje_tree.Tree Object
>> ancfile:str = output filename (basefile))
GaspNode(realnode,alphabet,log):
- Used by Gasp Class to handle specific node data during GASP.
>> realnode:Node Object (rje_tree.py)
>> alphabet:list of amino acids for use in GASP
>> log:Log Object

Uses general modules: copy, sys, time
Uses RJE modules: rje, rje_pam

Gasp Class

    GASP Class. Author: Rich Edwards (2005).
    GASP: Gapped Ancestral Sequence Prediction.
    
    Info:str
    - Name = Name (basefile)
    
    Opt:boolean
    - FixUp [True] = Fix AAs on way up (keep probabilities)
    - FixDown [False] = Fix AAs on initial pass down tree
    - Ordered [False] = Order ancestral sequence output by node number
    - PamTree [True] = Whether to output *.nsf & *.txt trees
    - DescOnly [False] = Limits ancestral AAs to those found in descendants
    - RST [False] = Produces RST-style AA probability text

    Stat:numeric
    - FixPam [0] = PAM distance fixed to X
    - RareCut [0.5] = Rare aa cut-off
    - XPass [1] =  How many extra passes to make down & up tree after initial GASP.  
    
    Obj:RJE_Objects
    - Tree:rje_tree.Tree Object (Uses SeqList and PAM objects of Tree)
Gasp.__init__(self,log=None,cmd_list=[],tree=None,ancfile='gasp')
    RJE_Object:
    > log:Log = rje.Log object
    > cmd_list:List = List of commandline variables
    > tree:Tree = rje_tree.Tree Object
    > ancfile:str = output filename (basefile)

    On intiation, this object:
    - sets the Log object
    - calls the _setAttributes() method to setup class attributes
    - calls the _cmdList() method to process relevant Commandline Parameters       
Gasp._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
Gasp._gapStatus(self)
    Predicts ancestral gap status.
Gasp._gaspProbs(self,aalist,useanc=False,dir='down',aaprobs=True,aasub=False,aafix=False,gpass=0)
    Work through tree calculating AA probs.
    >> aalist:list = alphabet
    >> useanc:boolean [False] = whether to consider predicted ancestor
    >> dir:str ['down'] = direction to move through tree ('up','down','rootonly')
    >> aaprobs:boolean [True] = whether to calculate AA probabilities
    >> aasub:boolean [False] = whether to change sequence to most likely AA
    >> aafix:boolean [False] = whether to fix most likely AA as probability 1.0
    >> gpass:int [0] = Extra pass (for log output only)
Gasp._setAttributes(self)
    Sets Attributes of Object:
    - Info:str ['Name']
    - Stats:float ['FixPam','RareCut','XPass']
    - Opt:boolean ['FixUp','FixDown','Ordered','PamTree','DescOnly','RST']
    - Obj:RJE_Object ['Tree']
Gasp.details(self)
    Returns object details as text.
Gasp.gasp(self)
    Performs GASP: Gapped Ancestral Sequence Prediction.

GaspNode Class

    Class for use by GASP only.
    >> node:rje_tree.Node   = 'real' node associated with GaspNode
    >> alphabet:list =[]    = list of letters in sequence - matches PAM matrix
    >> log:rje.Log  = Log object for error messages
    ancaap:list     = list of dictionaries [r]{a}, probability of amino acid a @ residue r
    ancgap:list     = [r], probablility of gap @ residue r
    ancfix:list     = [r], boolean whether residue fixed in descendants
    rst:list        = [r], string containing RST-style probability information
GaspNode.__init__(self,realnode,alphabet,log)
    Used by Gasp Class to handle specific node data during GASP.
    >> realnode:Node Object (rje_tree.py)
    >> alphabet:list of amino acids for use in GASP
    >> log:Log Object
GaspNode._buildAncLists(self)
    Builds initial ancaap and ancgap lists.
GaspNode._errorLog(self,errortxt)
    Tries to report error if log exists, else prints.
GaspNode.adjustAncAAP(self,rarecut,desconly=False,desc=[])
    Adjusts AA frequencies according to rarecut.
    >> rarecut:float = Frequency cut-off
    >> desconly:boolean [False] = whether to limit probabilities to AAs found in descendants.
    >> desc:list of descendant terminal nodes
GaspNode.makeRST(self,rstseq)
    Makes RST text using sequence list and own probablities and populates self.rst.
    >> rstseq:list = Terminal node sequences
GaspNode.probFromSeq(self)
    Makes probability matrices based on own sequence (1 or 0).

rje_ancseq Module Methods

rje_ancseq.runMain()




rje_blast [version 1.11] BLAST Control Module ~ [Top]

Module: rje_blast
Description: BLAST Control Module
Version: 1.11
Last Edit: 31/10/11
Imports: rje
Imported By: aphid, budapest, fiesta, gablam, gfessa, gopher_V2, haqesac, picsi, presto_V5, qslimfinder, seqmapper, slimfinder, slimsearch, rje_dbase, rje_phos, rje_seqgen, rje_yeast, RankByDistribution, ned_SLiMPrints, ned_SLiMPrints_Tester, ned_rankbydistribution, rje_hmm, rje_motif_cons, rje_motif_stats, rje_motiflist, rje_seq, rje_slimcore, slimprints
Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice

Function:
Performs BLAST searches and loads results into objects. Peforms GABLAM conversion of local alignments into global
alignment statistics.

Objects:
BLASTRun = Full BLAST run
BLASTSearch = Information for a single Query search within a BLASTRun
BLASTHit = Detailed Information for a single Query-Hit pair within BLASTRun
PWAln = Detailed Information for each aligned section of a Query-Hit Pair

Commandline:
* blastpath=X : path for blast files [c:/bioware/blast/] *Use fwd slashes (*Cerberus is special!)

* blastp=X : BLAST program (BLAST -p X) [blastp]
* blasti=FILE : Input file (BLAST -i FILE) [None]
* blastd=FILE : BLAST database (BLAST -d FILE) [None]
* formatdb=T/F : Whether to (re)format BLAST database [False]
* blasto=FILE : Output file (BLAST -o FILE) [*.blast]

* blaste=X : E-Value cut-off for BLAST searches (BLAST -e X) [1e-4]
* blastv=X : Number of one-line hits per query (BLAST -v X) [500]
* blastb=X : Number of hit alignments per query (BLAST -b X) [250]

* blastf=T/F : Complexity Filter (BLAST -F X) [True]
* blastcf=T/F : Use BLAST Composition-based statistics (BLAST -C X) [False]
* blastg=T/F : Gapped BLAST (BLAST -g X) [True]

* blasta=X : Number of processors to use (BLAST -a X) [1]
* blastopt=FILE : File containing raw BLAST options (applied after all others) []
* ignoredate=T/F : Ignore date stamps when deciding whether to regenerate files [False]

* gablamfrag=X : Length of gaps between mapped residue for fragmenting local hits [100]

Uses general modules: os, re, string, sys, time
Uses RJE modules: rje

rje_blast Module Version History

    # 0.0 - Initial Working Compilation.
    # 0.1 - No Out Object in Objects
    # 1.0 - Corrected to work with blastn (and blastp)
    # 1.1 - Added special calling for Cerberus
    # 1.2 - Added GABLAM and GABLAMO to BlastHit
    # 1.3 - Added GABLAM calculation upon reading BLAST results and clearing Alignment sequences to save memory
    # 1.4 - Tidied up the module with improved logging and progress reporting. Added dbCleanup.
    # 1.5 - Added checking for multiple hits with same name and modified BLAST_Run.hitToSeq()
    # 1.6 - Added nucleotide vs protein searches to GABLAM
    # 1.7 - Added nucleotide vs nucleotide searches to GABLAM
    # 1.8 - Added local alignment summary output to ReadBLAST()
    # 1.9 - Added BLAST -C
    # 1.10- Added BLAST -g
    # 1.11- Added gablamfrag=X : Length of gaps between mapped residue for fragmenting local hits [100]

BLASTHit Class

    BLAST Hit Class. Author: Rich Edwards (2005).

    Info:str
    - Name = Short Name of Hit Sequence (First Word of description line)
    
    Opt:boolean
    - GABLAM = Whether GABLAM search has been performed

    Stat:numeric
    - BitScore
    - E-Value
    - Length
    - GablamFrag = Length of gaps between mapped residue for fragmenting local hits [100]

    List:list

    Dict:Dictionary
    - GABLAM = GABLAM and GABLAMO stats
    - Local = Local alignment dictionary {alnID:{stats}}

    Obj:RJE_Objects

    Other:
    - aln:list of PWAln Objects
BLASTHit._addAln(self)
    Adds and returns a PWAln Object.
BLASTHit._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
BLASTHit._setAttributes(self)


BLASTHit.alnNum(self)


BLASTHit.globalFromLocal(self,qrylen,keepaln=False)
    Returns a dictionary of global alignment stats from Query-Hit local alignments.
    >> querylen:int = length of query sequence
    >> keepaln:bool [False] = Whether to store GABLAM alignments in Hit object
    << {Query:{},Hit:{}}, where each value is a dictionary of [GABLAM(O) ID, GABLAM(O) Sim, GABLAM(O) Len]
BLASTHit.makeLocalDict(self)
    Generates summary Local alignment dictionary.

BLASTRun Class

    BLASTRun Class. Author: Rich Edwards (2005).

    Info:str
    - Name = Output File Name (BLAST -o X)
    - Type = Program (blastp etc.) (BLAST -p X)
    - DBase = Database to search (BLAST -d X)
    - InFile = Input file name (BLAST -i X)
    - OptionFile = file containing a string of BLAST options to append to commandline
    - BLAST Path = path to blast programs
    - BLASTCmd = system command used to generate BLAST in self.blast()
    - BLASTOpt = Additional BLAST options
    
    Opt:boolean
    - Composition Statistics
    - Complexity Filter = (BLAST -F) [True]
    - FormatDB = whether to (re)format database before blasting
    - GappedBLAST = Gapped BLAST (BLAST -g X) [True]
    - IgnoreDate = Ignore date stamps when deciding whether to regenerate files [False]

    Stat:numeric
    - E-Value = e-value cut-off (BLAST -e X) [1e-4]
    - OneLine = Number of one-line hits per query (BLAST -v X) [500]
    - HitAln  = Number of hit alignments per query (BLAST -b X) [250]
    - DBLen = Length of Database (letters)
    - DBNum = Number of Sequences in Database
    - BLASTa = Number of processors to use (BLAST -a X) [1]

    List:list

    Dict:dictionary    

    Obj:RJE_Objects

    Other:
    search:list = list of BLASTSearch Objects
BLASTRun._addSearch(self)
    Adds and returns a new search object.
BLASTRun._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
BLASTRun._setAttributes(self)
    Sets Attributes of Object.
BLASTRun.blast(self,wait=True,type=None,cleandb=False,use_existing=False,log=False)
    Performs BLAST using object attributes.
    >> wait:boolean  = whether to wait for BLAST. [True]
    >> type:str = type of BLAST search [None]
    >> cleandb:bool = whether to cleanup (delete) searchDB files after search [False]
    >> use_existing:bool = if True, will check for existing result and use if newer than files
    >> log:bool = Whether to log BLAST run
BLASTRun.checkBLAST(self,resfile=None,logcheck=True)
    Checks that each BLAST started has an end, thus identifying BLAST runs that have been terminated prematurely and
    need to be re-run.
    >> resfile:str = Results File (set as self.info['Name'])
    >> logcheck:boolean = Whether to print findings to log [True]
    << True if OK, False if not.
BLASTRun.formatDB(self,fasfile=None,protein=True,force=True,log=True,checkage=None)
    BLAST formats database given.
    >> fasfile:str = Name of file to form database [None]
    >> protein:boolean = whether protein [True]
    >> force:boolean = whether to overwrite an existing formatted DB [True]
BLASTRun.hitNum(self)


BLASTRun.hitToSeq(self,seqlist,searchlist=[],filename=None,appendfile=False)
    Saves hits from given searches to sequence object and, if given, a file.
    >> seqlist:rje_seq.SeqList Object *Necessary!* 
    >> seachlist:list of BLASTSearch objects [self.search if none]
    >> filename:str = Name of fasta output file - no save if None [None]
    >> appendfile:bool = Whether to append file
    << returns dictionary of {Hit:Sequence}
BLASTRun.readBLAST(self,resfile=None,clear=False,gablam=False,unlink=False,local=False,screen=True,log=False)
    Reads BLAST Results into objects.
    >> resfile:str = Results File (set as self.info['Name'])
    >> clear:Boolean = whether to clear current searches (True) or just append (False) [False]
    >> gablam:Boolean = whether to calculate gablam statistics and clear alignments to save memory [False]
    >> unlink:Boolean = whether to delete BLAST results file after reading [False]
    >> local:Boolean = whether to store Local alignment dictionary with basic alignment data [False]
    >> screen:Bool [False] = whether to output reading details to screen
    >> log:Bool [False] = whether to output reading details to log
    << returns True if (apparently) read to completion OK, else False
BLASTRun.readNextBLASTSearch(self,fpos=0,resfile=None,gablam=False,local=False)
    Reads BLAST Results into objects.
    >> resfile:str = Results File 
    >> gablam:Boolean = whether to calculate gablam statistics and clear alignments to save memory [False]
    >> local:Boolean = whether to store Local alignment dictionary with basic alignment data [False]
    >> fpos:Integer = position in results file to start reading from.
    << returns tuple: (Search object or None if no more results to be read,fpos).
BLASTRun.saveCutBLAST(self,outfile=None)
    Saves a cutdown version of current BLAST (no alignments), which has enough lines to be successfully read in by
    the BLASTRun class. (Obviously, no GABLAM statistics can be calculated!
    >> outfile:str = outfile to use if different from self.info['Name']
BLASTRun.searchNum(self)


BLASTRun.searchSeq(self,seqlist,proglog=True,inverse=False)
    Returns dictionary of searchesas sequences from seqlist (or None if missing).
    >> seqlist:SeqList object
    >> proglog:bool [True] = whether to log progress 
    >> inverse:bool [False] = whether to reverse dictionary to sequence:search
    << {Search Object: Sequence Object}

BLASTSearch Class

    BLAST Search Class. Author: Rich Edwards (2005).

    Info:str
    - Name = Short Name of Search Query (First word of description line)
    
    Opt:boolean

    Stat:numeric
    - Length = Sequence Length

    List:list

    Dict:dictionary    

    Obj:RJE_Objects

    Other:
    hit : list of BLASTHit Objects
BLASTSearch._addHit(self)
    Adds and returns a new search object.
BLASTSearch._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
BLASTSearch._setAttributes(self)


BLASTSearch.checkHitNames(self)
    Checks for multiple hits of same name.
BLASTSearch.gablam(self,keepaln=False)
    Performs GABLAM analysis on all hits (unless done) and clears alignments.
    >> keepaln:bool [False] = Whether to store GABLAM alignments in Hit object
BLASTSearch.hitNum(self)


BLASTSearch.hitSeq(self,seqlist,proglog=True,inverse=False)
    Returns dictionary of hits as sequences from seqlist (or None if missing).
    >> seqlist:SeqList object
    >> proglog:bool [True] = whether to log progress 
    >> inverse:bool [False] = whether to reverse dictionary to sequence:hit
    << {Hit Object: Sequence Object}
BLASTSearch.saveHitIDs(self,outfile=None)
    Saves IDs of hits in file (e.g. for later fastacmd extraction).
    >> outfile:str = outfile to use.

PWAln Class

    Pairwise Aligned Sequence (Global/Local) Class. Author: Rich Edwards (2005).

    Info:str
    - Name = Name
    - QrySeq = Text of aligned portion (Query only)
    - SbjSeq = Text of aligned portion (Subject only)
    - AlnSeq = Text of aligned portion (Alignment line)
    
    Opt:boolean

    Stat:numeric
    - BitScore
    - Expect
    - Length (integer value)
    - Identity (integer value)
    - Positives (integer value)
    - QryStart (1-N)
    - QryEnd
    - SbjStart (1-N)
    - SbjEnd

    List:list

    Dict:dictionary    

    Obj:RJE_Objects
PWAln._clear(self)
    Clears all data - dodgy alignment.
PWAln._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
PWAln._setAttributes(self)




rje_blast Module Methods

rje_blast.checkForDB(dbfile=None,checkage=True,log=None,protein=True)
    Checks for BLASTDB files and returns True or False as appropriate.
    >> dbfile:str = sequence file forming basis of database
    >> checkage:Boolean = also check that blastdb files are newer than dbfile
    >> log:Log Object
    >> protein:boolean = whether database is protein
rje_blast.cleanupDB(callobj=None,dbfile=None,deletesource=False)
    Deletes files created by formatdb.
    >> callobj:Object = object calling method
    >> dbfile:str = sequence file forming basis of database
    >> deletesource:boolean = whether to delete dbfile as well
rje_blast.expectString(_expect)
    Returns formatted string for _expect value.
rje_blast.formatDB(fasfile,blastpath,protein=True,log=None)
    Formats a blastDB.
    >> fasfile:str = file to format
    >> blastpath:str = path to BLAST programs
    >> protein:boolean = whether protein sequences
    >> log:Log Object
rje_blast.runMain()




rje_blast Module ToDo Wishlist

    # [ ] : Output of alignment with options for line lengths and numbers +-
    # [ ] : OrderAln by any stat ['BitScore','Expect','Length','Identity','Positives','QryStart','QryEnd','SbjStart','SbjEnd']
    # [Y] : Optionfile and commandline options blast-e=X etc.
    # [ ] : Test with other blast programs
    # [ ] : Add standalone running for blast searching?
    # [ ] : Replace blast.search with list['Search'] etc?
    # [ ] : Add documentation for implementation details - each method?
    # [ ] : Locate which classes/methods call BLASTRun.hitToSeq and look to improve reporting etc.
    # [ ] : Fix DNA implementation of GABLAM to allow Ordered GABLAM in either direction.
    # [ ] : Add positional information to GABLAM dictionary - start and end of aligned portions
    # [ ] : Add oritentation of Query and Hit for DNA GABLAM
    # [ ] : Check/fix the database format checking of DNA databases

rje_dismatrix [version 2.6] Distance Matrix Module ~ (rje_dismatrix_V2.py)[Top]

Module: rje_dismatrix
Description: Distance Matrix Module
Version: 2.6
Last Edit: 05/12/11
Imports: rje, rje_zen
Imported By: aphid, budapest, gablam, gfessa, gopher_V2, happi, pingu, qslimfinder, slimfinder, slimsearch, rje_biogrid, rje_hprd, rje_seq, rje_slimcore, rje_slimhtml, slimjim
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
DisMatrix Class. Stores distance matrix data and contains methods for extra calculations, such as MST. If pylab is
installed, a distance matrix can also be turned into a heatmap.

Commandline:
* loadmatrix=FILE : Loads a matrix from FILE [None]
* symmetric=T/F : Whether the matrix should be symmetrical (e.g. DisAB = DisBA) [False]
* outmatrix=X : Type for output matrix - text / mysql / phylip / png
* nsf2nwk=T/F : Whether to convert extension for Newick Standard Format from nsf to nwk (for MEGA) [False]

Uses general modules: copy, os, re, string, sys, time
Uses RJE modules: rje
Other modules needed: None

rje_dismatrix_V2 Module Version History

    # 0.0 - Initial Compilation.
    # 1.0 - Working version used in several programs.
    # 2.0 - Updated module inline with newer modules and layout. Incorporation of Minimum Spanning Tree.
    # 2.1 - Attempted in incorporate pylab heatmap generation from distance matrix.
    # 2.2 - Added loading of matrix from GABLAM-style database file.
    # 2.3 - Added UPGMA method.
    # 2.4 - Added UPGMA branch length return.
    # 2.5 - Added PNG output based on rje_slimhtml.
    # 2.6 - Added nsf2nwk=T/F - Whether to convert extension for Newick Standard Format from nsf to nwk (for MEGA)

DisMatrix Class

    Sequence Distance Matrix Class. Author: Rich Edwards (2007). This class can handle *positive* distances only!

    Info:str
    - Name = Name of matrix
    - Type = Identifying type (e.g. 'PWAln ID'). Should be same as key in SeqList.obj
    - Description = Extra information if desired
    - OutMatrix = Type for output matrix - text / mysql / phylip
    
    Opt:boolean
    - nsf2nwk = Whether to convert extension for Newick Standard Format from nsf to nwk (for MEGA) [False]
    - Symmetric = Whether Seq1->Seq2 = Seq2->Seq1

    Stat:numeric

    List:list

    Dict:dictionary
    - Matrix = dictionary of Object pairs and their distance {Obj1:{Obj2:Dis}}

    Obj:RJE_Objects
DisMatrix.MST(self,objkeys=[],normalisation=1.0)
    Calculate MST size for listed objects in matrix. Adapted from Norman Davey's SlimDiscApp.
    >> objkeys:list = list of object keys for MST calculation
    >> normalisation:float = MST weighting factor (similarities raised to this power)
DisMatrix._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
DisMatrix._setAttributes(self)
    Sets Attributes of Object.
DisMatrix.addDis(self,obj1,obj2,dis)
    Adds a distance to matrix.
    >> obj1 and obj2:key Objects
    >> dis:Float = 'Distance' measure
DisMatrix.checkSymmetry(self,force=False)
    Checks symmetry of matrix and forces if desired.
DisMatrix.cluster(self,maxdis=0,singletons=True)
    Returns a list of clusters objects.
    >> maxdis:num [0] = Value representing maximum distance. (Will read from matrix if 0)
    >> singletons:bool [True] = Whether to include singleton clusters (True), combine into single cluster (False), or remove (None).
DisMatrix.findEdges(self,objkeys,obj)
    Not sure why but it returns a reversed list of all other objects.
DisMatrix.forceSymmetry(self,method='min',missing=-1)
    Compresses an Asymmetrical matrix into a symmetrical one.
    >> method:str = method for combining distances: min/max/mean
    >> missing:float = Value to be given to missing distances. If < 0 will ignore these values.
DisMatrix.getDis(self,obj1,obj2,default=None)
    Returns distance from matrix or None if comparison not made.
    >> obj1 and obj2: Objects
    >> default:anything = distance returned if comparison not made.
    << Float = 'Distance' measure
DisMatrix.loadFromDataTable(self,filename=None,delimit=None,clear=True,key1='Qry',key2='Hit',distance='Qry_OrderedAlnID',normalise=100.0,inverse=True,checksym=False)
    Loads distance matrix from e.g. GABLAM output. (Defaults are for GABLAM)
    >> filename:str = Input file name. Will use self.info['Name'] if None.
    >> delimit:str = Text delimiter. Will ascertain from filename if None.
    >> clear:bool = Whether to clear the current matrix before loading the new one.
    >> key1:str ['Qry'] = column identifying obj1 key for matrix
    >> key2:str ['Hit'] = column identifying obj2 key for matrix
    >> distance:str ['Qry_OrderedAlnID'] = column storing distance measure to read
    >> normalise:float [100.0] = normalise distances to 0-1 scale by dividing by this number (if non-zero)
    >> inverse:bool [True] = whether to subtract value from 1 to get distance (reading a similarity measure)
    >> checksym:bool [False] = Whether to check symmetry of matrix. Will enforce and/or update self.opt['Symmetry']
DisMatrix.loadMatrix(self,filename=None,delimit=None,clear=True,usecase=False,checksym=True,default=0.0)
    Loads distance matrix from a delimited file. The first column should contain unique identifiers, each of which
    should also have a column heading of its own. If self.opt['Symmetry'] is True then symmetry will be enforced.
    >> filename:str = Input file name. Will use self.info['Name'] if None.
    >> delimit:str = Text delimiter. Will ascertain from filename if None.
    >> clear:bool = Whether to clear the current matrix before loading the new one.
    >> usecase:bool = Enforces matching of case for rows and columns. Else with match any case column to rows.
    >> checksym:bool = Whether to check symmetry of matrix. Will enforce and/or update self.opt['Symmetry']
    >> default:float = Default values to be given to empty cells.
DisMatrix.maxDis(self)
    Returns maximum distance in matrix.
DisMatrix.minDis(self,missing=1.0)
    Returns maximum distance in matrix.
    >> missing:float [1.0] = Value to be assigned missing distances (and starting min dis)
DisMatrix.minDisPair(self,missing=1.0)
    Returns maximum distance in matrix.
    >> missing:float [1.0] = Value to be assigned missing distances (and starting min dis)
    >> returnpair:bool [False] = whether to return pair of objects with mindis rather than distance
DisMatrix.objName(self,obj)
    Returns object name.
DisMatrix.objNum(self)


DisMatrix.remove(self,obj)
    Removes object for self.dict['Matrix'].
DisMatrix.rename(self,newnames={},missing='keep')
    Goes through matrix and renames objects using given dictionary.
    >> newnames:dict = mapping of existing names to new names
    >> missing:str = treatment of names missing as keys (keep/delete)
DisMatrix.saveCytoscape(self,basename=None,type='dis',cutoff=1.0,inverse=False,selflink=False)
    Outputs matrix in Cytoscape format.
    >> basename:str [None] = Basic name for files. Will use rje.baseFile(self.info['Name']) if missing.
    >> type:str ['dis'] = Type of interaction for *.sif file.
    >> cutoff:float [1.0] = Distances >= this will not be output as links
    >> inverse:bool [False] = If True, distances below cutoff will not be output
    >> selflink:bool [False] = whether to output self distances
DisMatrix.saveMatrix(self,objkeys=None,filename='matrix.txt',delimit=',',format=None,log=True,default=1.0)
    Saves matrix in file. Uses None for missing values.
    >> objkeys:list of Objects that form keys of matrix in output order
    >> filename:str = output file name
    >> delimit:str = separator between columns
    >> format:str = 'text' [Default], 'mysql' = lower case header, 'phylip' = phylip
    >> log:Boolean = whether to print report to log [True]
    >> default:anything = distance to use when no distance present in dictionary
DisMatrix.savePNG(self,tree,basefile=None,nsftree=None,savensf=True,bycluster=0,singletons=True)
    Saves Matrix as heatmap and tree to file using R code.
    >> tree:rje_tree.Tree object = Needs to be given to method to avoid circularity
    >> basefile:str [None] = Name for output file. Will use basefile by default.
    >> nsftree:str [None] = NSF Tree for output. Will make from distance matrix using UPGMA if None.
    >> bycluster:int [0] = Number of sequences to split up according to clustering
    >> singletons:bool [True] = Whether to include singletons in main output
DisMatrix.sortMinQueue(self,min_queue,high_scores)
    Returns min_queue list ordered by high_score (smallest to highest).
DisMatrix.sortObj(self)


DisMatrix.treeName(self,name,nospace=False)
    Reformats name to be OK in NSF file.
DisMatrix.upgma(self,sym='mean',nosim=1.0,log=True,objkeys=[],returnlen=False,checksym=True)
    Generates a UPGMA tree (NSF) from distance matrix.
    >> sym:str ['mean'] = Symmetry method to employ
    >> nosim:float [1.0] = Distance to be given for missing distances (no similarity)

rje_dismatrix_V2 Module Methods

rje_dismatrix_V2.runMain()


rje_dismatrix_V2.sortQueue(min_queue,high_scores)
    Returns min_queue list ordered by high_score (smallest to highest).

rje_dismatrix_V2 Module ToDo Wishlist

    # [ ] : Possibly add a mergeCase() method to combine mixed case versions of same keys.
    # [ ] : Finish heat map implementation at some point.
    # [ ] : Consider implementing simple NJ method. (To learn how!)
    # [ ] : Add XGMML output
    # [ ] : Check and replace old rje_dismatrix in all modules
    # [ ] : Check/Add NSF output from loaded dismatrix
    # [ ] : Upgrade to new rje_obj class structure.
    # [ ] : General improvement of docstrings etc.
    # [ ] : Add manual details for standalone functionality.

rje_disorder [version 0.7] Disorder Prediction Module ~ [Top]

Module: rje_disorder
Description: Disorder Prediction Module
Version: 0.7
Last Edit: 08/09/10
Imports: rje, rje_zen
Imported By: presto_V5, rje_seqplot, sfmap2png, slim_pickings, ned_disorderHelper, rje_motif_stats, rje_motiflist, rje_sequence, rje_slimcalc
Copyright © 2006 Richard J. Edwards - See source code for GNU License Notice

Function:
This module currently has limited function and no standalone capability, though this may be added with time. It is
designed for use with other modules. The disorder Class can be given a sequence and will run the appropriate
disorder prediction software and store disorder prediction results for use in other programs. The sequence will have
any gaps removed.

Currently four disorder prediction methods are implemented:
* IUPred : Dosztanyi Z, Csizmok V, Tompa P & Simon I (2005). J. Mol. Biol. 347, 827-839. This has to be installed
locally. It is available on request from the IUPred website and any use of results should cite the method. (See
http://iupred.enzim.hu/index.html for more details.) IUPred returns a value for each residue, which by default,
is determined to be disordered if > 0.5.
* FoldIndex : This is run directly from the website (http://bioportal.weizmann.ac.il/fldbin/findex) and more simply
returns a list of disordered regions. You must have a live web connection to use this method!
* ANCHOR : Meszaros B, Simon I & Dosztanyi Z (2009). PLoS Comput Biol 5(5): e1000376. This has to be installed
locally. It is available on request from the ANCHOR website and any use of results should cite the method. (See
http://anchor.enzim.hu/ for more details.) ANCHOR returns a probability value for each residue, which by default,
is determined to be disordered if > 0.5.
* Parse: Parsed disorder from protein sequence name, e.g. DisProt download.
#X-Y = disordered region; &X-Y = ordered region [0.0]

For IUPred, the individual residue results are stored in Disorder.list['ResidueDisorder']. For both methods, the
disordered regions are stored in Disorder.list['RegionDisorder'] as (start,stop) tuples.

Commandline:
### General Options ###
* disorder=X : Disorder method to use (iupred/foldindex/anchor/parse) [iupred]
* iucut=X : Cut-off for IUPred/ANCHOR results [0.2]
* iumethod=X : IUPred method to use (long/short) [short]
* sequence=X : Sequence to predict disorder for (autorun) []
* name=X : Name of sequence to predict disorder for []
* minregion=X : Minimum length of an ordered/disordered region [0]

### System Settings ###
* iupath=PATH : The full path to the IUPred exectuable [c:/bioware/iupred/iupred.exe]
* anchor=PATH : Full path to ANCHOR executable []
* filoop=X : Number of times to try connecting to FoldIndex server [10]
* fisleep=X : Number of seconds to sleep between attempts [2]
* iuchdir=T/F : Whether to change to IUPred directory and run (True) or rely on IUPred_PATH env variable [False]

Uses general modules: copy, os, string, sys, time, urllib2
Uses RJE modules: rje
Other modules needed: None

rje_disorder Module Version History

    # 0.0 - Initial Compilation.
    # 0.1 - Added parsing of disorder from name as an option instead of disorder prediction
    # 0.2 - Added Folded tuple as well as disordered
    # 0.3 - Added PrintLog opt attribute
    # 0.4 - Added option for correct use of IUPred_PATH environment variable
    # 0.5 - Added Minimum length of an ordered/disordered region
    # 0.6 - Added ANCHOR prediction.
    # 0.7 - Added globProportion calculation.

Disorder Class

    Disorder Prediction Class. Author: Rich Edwards (2005).

    Info:str
    - ANCHOR = Full path to ANCHOR executable
    - Sequence = sequence to give to disorder prediction (will have gaps removed)
    - Disorder = Disorder method to use (iupred/foldindex) [None]
    - IUPath = The full path to the IUPred exectuable [c:/bioware/iupred/iupred.exe]
    - IUMethod = IUPred method to use (long/short) [short]
    
    Opt:boolean
    - Flat = whether ResidueDisorder is a "flat" 1/0 or graded (e.g. raw IUPred)
    - IUChDir = Whether to change to IUPred directory and run (True) or rely on IUPred_PATH env variable [False]
    - PrintLog = whether to print disorder prediction status to Log [False]

    Stat:numeric
    - IUCut = Cut-off for IUPred results [0.2]
    - FILoop = Number of times to try connecting to FoldIndex server [10]
    - FISleep = Number of seconds to sleep between attempts [2]
    - MinRegion = Minimum length of an ordered/disordered region [0]
    
    List:list
    - ResidueDisorder = individual (IUPRed) residue results 
    - RegionDisorder = disordered regions as (start,stop) tuples (1->L)
    - RegionFold = folded (i.e. not disordered regions) as (start,stop) tuples (1->L)
    
    Dict:dictionary    

    Obj:RJE_Objects
Disorder.ANCHOR(self,retry=2)
    Runs ANCHOR disorder prediction.
Disorder._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
Disorder._setAttributes(self)
    Sets Attributes of Object.
Disorder.disorder(self,sequence='',name='')
    Takes a sequence, degaps, and runs disorder prediction.
    >> sequence:str = protein sequence
    >> name:str = (optional) name for sequence - goes in self.info['Name']
Disorder.flatten(self)
    Converts ResidueDisorder into "flat" 0.0 or 1.0.
Disorder.foldIndex(self)
    Runs FoldIndex disorder prediction.
Disorder.globProportion(self,absolute=False)
    Returns the proportion that is globular.
Disorder.iuPred(self,retry=2)
    Runs IUPred disorder prediction.
Disorder.minRegion(self)
    Reduced self.list['RegionDisorder']/self.list['RegionFold'] using stat['MinRegion'].
Disorder.parseDisorder(self)
    Parses disordered regions from sequence name (e.g. DisProt download).
    #X-Y = disordered region [1.0]; &X-Y = ordered region [0.0]; All else neutral [0.5];
Disorder.summary(self)
    Returns a string summary of the disorder.

rje_disorder Module Methods

rje_disorder.runMain()




rje_disorder Module ToDo Wishlist

    # [Y] : Add region buffer - min length of region allowed before ignoring
    # [Y] : Add a PrintLog option that controls whether printing to Log or not.
    # [ ] : Neaten and tidy
    # [ ] : Add domain-based disorder stuff?

rje_hmm [version 1.3] HMMer Control Module ~ [Top]

Module: rje_hmm
Description: HMMer Control Module
Version: 1.3
Last Edit: 25/11/08
Imports: rje, rje_blast, rje_zen
Imported By: unifake, rje_ensembl
Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice

Function:
This module is designed to perform basic HMM functions using the HMMer program. Currently, there are three functions
that may be performed, separately or consecutively:
* 1. Use hmmbuild to construct HMMs from input sequence files
* 2. Search a sequence database with HMMs files
* 3. Convert HMMer output into a delimited text file of results.

Commandline:
## Build Options ##
* makehmm=LIST : Sequence file(s). Can include wildcards [None]
* hmmcalibrate=T/F : Whether to calibrate HMM files once made [True]

## Search Options ##
* hmm=LIST : HMM file(s). Can include wildcards. [*.hmm]
* searchdb=FILE : Fasta file to search with HMMs [None]
* hmmoptions=LIST : List or file of additional HMMer search options (joined by whitespace) []
* hmmpfam=T/F : Performs standard HMMer PFam search (--cut_ga) (or processes if present) [False]
* hmmout=FILE : Pipe results of HMM searches into FILE [None]
* hmmres=LIST : List of HMM search results files to convert (wildcards allowed) []
* hmmtab=FILE : Delimited table of results ('None' to skip) [searchdb.tdt]
* cleanres=T/F : Option to reduce size of HMM results file by removing no-hit sequences [True]

## System Parameters ##
* hmmerpath=PATH : Path for hmmer files [/home/richard/Bioware/hmmer-2.3.2/src/]
* force=T/F : Whether to force regeneration of new HMMer results if already existing [False]
* gzip=T/F : Whether to gzip (and gunzip) HMMer results files (not Windows) [True]

Classes:
HMMRun Object = Full HMM run
HMMSearch Object = Information for a single Query search within a BLASTRun
HMMHit Object = Detailed Information for a single Query-Hit pair within BLASTRun
rje_blast.PWAln Object = Detailed Information for each aligned section of a Query-Hit Pair

Uses general modules: glob, os, re, string, sys, time
Uses RJE modules: rje, rje_blast

rje_hmm Module Version History

    # 0.0 - Initial Working Compilation.
    # 1.0 - Working version with multiple HMM capacity
    # 1.1 - Added hmmpfam option
    # 1.2 - Cleaned up and debugged for rje_ensembl.ensDat()

HMMHit Class

    BLAST Hit Class. Author: Rich Edwards (2005).

    Info:str
    - Name = Short Name of Hit Sequence (First Word of description line)
    
    Opt:boolean

    Stat:numeric
    - BitScore
    - E-Value
    - Length

    Obj:RJE_Objects

    Other:
    - aln:list of PWAln Objects
HMMHit._addAln(self)
    Adds and returns a PWAln Object.
HMMHit._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
HMMHit._setAttributes(self)
    Sets Attributes of Object:
    - Info:str ['Name','Type','Description']
    - Stats:float ['BitScore','E-Value','Length']
    - Opt:boolean []
    - Obj:RJE_Object []
HMMHit.alnNum(self)




HMMRun Class

    HMMRun Class. Author: Rich Edwards (2005).

    Info:str
    - SearchDB = Fasta file to search with HMMs [None]
    - HMMTab = Delimited table of results ('None' to skip) [searchdb.hmmer.tdt]
    - HMMOut = Pipe results of HMM searches into FILE [None]
    - HMMerPath = path for hmmer files [c:/bioware/hmmer/] *Use fwd slashes

    Opt:boolean
    - CleanRes = Option to reduce size of HMM results file by removing no-hit sequences [True]
    - GZip = Whether to gzip (and gunzip) HMMer results files (not Windows) [True]
    - HMMCalibrate = Whether to calibrate HMM files once made [True]
    - HMMPFam = Performs standard HMMer PFam search (--cut_ga) (or processes if present) [False]

    Stat:numeric

    List:lists    
    - MakeHMM = Sequence file(s). Can include wildcards [None]
    - HMM = HMM file(s). Can include wildcards. [*.hmm]
    - HMMOptions = List or file of additional HMMer search options (joined by whitespace) []
    - HMMRes = List of HMM search results files to convert []
    
    Obj:RJE_Objects
    
    Other:
    - search:list = list of HMMSearch Objects (Kept for BLAST consistency)
HMMRun._addSearch(self)
    Adds and returns a new search object.
HMMRun._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
HMMRun._run(self)
    Controls main Class functions:
    * 1. Use hmmbuild to construct HMMs from input sequence files
    * 2. Search a sequence database with HMMs files
    * 3. Convert HMMer output into a delimited text file of results.
HMMRun._setAttributes(self)
    Sets Attributes of Object
HMMRun.buildHMM(self,seqfile,hmmfile=None)
    Makes an HMM from a sequence alignment file.
    >> seqfile:str = Name of sequence file
    >> hmmfile:str = Name of HMM file [*.hmm]
    << hmmfile if made, None if failed.
HMMRun.hmmSearch(self,hmm,dbase=None,outfile=None,wait=True)
    Performs HMMer Search using object attributes.
    >> hmm:str = Name of HMM file 
    >> dbase:str = Name of DBase file [self.info['SearchDB']]
    >> outfile:str = Name of Output file file [self.info['HMMOut']]
    >> wait:boolean  = whether to wait for HMMer. [True]
    << returns outfile or None if fails
HMMRun.hmmTable(self,outfile='',append=False,delimit=None)
    Outputs results table.
    >> outfile:str = Name of output file
    >> append:boolean = whether to append file
    >> delimit:str = Delimiter to use [\t]
HMMRun.readHMMPFamSearch(self,resfile=None,readaln=False)
    Reads HMM Search Results into objects.
    >> resfile:str = Results File (set as self.info['OutFile'])
    >> readaln:boolean = whether to bother reading Alignments into objects [False] !!! Currently always False !!!
HMMRun.readHMMSearch(self,resfile=None,readaln=False)
    Reads HMM Search Results into objects.
    >> resfile:str = Results File (set as self.info['OutFile'])
    >> readaln:boolean = whether to bother reading Alignments into objects [False] (!!!currently always True!!!)
HMMRun.readResults(self,clear=True,readaln=False)
    Reads results from self.list['HMMRes'] into objects.
    >> clear:boolean = whether to clear self.search before reading [True]
    >> readaln:boolean = whether to bother reading Alignments into objects [False]

HMMSearch Class

    HMM Search Class. Author: Rich Edwards (2005).

    Info:str
    - Name = Name of HMM Search File
    - DBase = Name of database File Searched
    
    Opt:boolean

    Stat:numeric
    - DBNum = Number of sequences in database searched

    Obj:RJE_Objects

    Other:
    hit : list of HMMHit Objects
HMMSearch._addHit(self)
    Adds and returns a new hit object.
HMMSearch._setAttributes(self)
    Sets Attributes of Object:
    - Info:str ['Name','DBase']
    - Stats:float ['DBNum']
    - Opt:boolean []
    - Obj:RJE_Object []
HMMSearch.hitNum(self)




rje_hmm Module Methods

rje_hmm.runMain()




rje_hmm Module ToDo Wishlist

    # [y] Make HMM from alignment
    # [y] Perform HMM search
    # [y] Load in Results to Objects
    # - [Y] Query HMMSearch Objects
    # - [y] Query-Hit HMMHit Objects
    # - [y] Query-Hit PWAln Objects
    # [ ] : Add stats and more HMM options
    # [ ] : Tidy up classes in line with other RJE modules
    # [ ] : Add a cleanres=T option to reduce size of HMM results file by removing no-hit sequences.

RJE_HTML [version 0.0] Module for generating HTML ~ [Top]

Module: RJE_HTML
Description: Module for generating HTML
Version: 0.0
Last Edit: 14/03/10
Imports: rje, rje_zen
Imported By: happi, rje_pydocs, rje_tree
Copyright © 2010 Richard J. Edwards - See source code for GNU License Notice

Function:
This module is primarily for general methods for making HTML pages for other modules.

Commandline:
* stylesheets=LIST : List of CSS files to use ['../example.css','../redwards.css']
* tabber=FILE : Tabber javascript file location ['../javascript/tabber.js']
* border=X : Border setting for tables [0]

See also rje.py generic commandline options.

Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje, rje_db, rje_slim, rje_uniprot, rje_zen
Other modules needed: None

rje_html Module Version History

    # 0.0 - Initial Compilation.

HTML Class

    HTML Class. Author: Rich Edwards (2010).

    Info:str
    - Tabber = Tabber javascript file location ['../javascript/tabber.js']
    
    Opt:boolean

    Stat:numeric
    - Border = Border setting for tables [0]

    List:list
    - StyleSheets = List of CSS files to use ['../example.css','../redwards.css']

    Dict:dictionary    

    Obj:RJE_Objects
HTML.HTMLcmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
HTML.HTMLdefaults(self)


HTML._cmdList(self)


HTML._setAttributes(self)
    Sets Attributes of Object.
HTML.tabberHTML(self,id,tablist,level=0)
    Returns text for Tabber HTML.

rje_html Module Methods

rje_html.checkHTML(hpage)
    Checks for existence of complete HTML page.
rje_html.domainLink(domain,frontpage=False)
    Returns domain link text.
rje_html.geneLink(gene,frontpage=False)
    Returns gene link text.
rje_html.htmlHead(title,stylesheets=['../example.css','../redwards.css'],tabber=True,frontpage=False,nobots=True,keywords=[],javascript='../javascript/')
    Returns text for top of HTML file.
    >> title:str = Title of webpage.
    >> stylesheets:list = List of stylesheets to use.
    >> tabber:bool [True] = whether page has tabber tabs
    >> frontpage:bool [False] = whether to replace all '../' links with './'
    >> nobots:bool [True] = whether to screen page from bot discovery
    >> keywords:list [] = List of keywords for header
    >> javascript:str ['../javascript/'] = Path to javascript files for tabber tabs
rje_html.htmlTail(copyright='RJ Edwards 2012',tabber=True)
    Returns text for bottom of HTML.
    >> copyright:str = copyright text'
    >> tabber:bool = whether page has tabber tabs
rje_html.runMain()


rje_html.seqDetailsHTML(callobj,gene,dbxref):    #gene,seqid,dbxref,desc,godata)
    Returns HTML text for seq details table.
    >> gene:str = Gene symbol
    >> seqid:str = Sequence Identifier
    >> dbxref:dict = Dictionary of {db:id} for GeneCards, EBI, EnsEMBL, HPRD, OMIM
    >> desc:str = Sequence description
    >> godata:dict = {CC/BP/MF:[(id,name)] list}
rje_html.slimLink(pattern,frontpage=False)
    Returns gene link text.
rje_html.stripTags(html,keeptags=[])
    Strips all HTML tag text from html code, except listed keeptags.
rje_html.tabberHTML(id,tablist,level=0)
    Returns text for Tabber HTML.
    >> id:str = Identifier for Tabber object
    >> tablist:list = List of (tab_title, tab_html_text) tuples
    >> level:int = Level of Tabber object (base = level)
rje_html.tabberTabHTML(id,text,title='')
    Returns text for TabberTab HTML.
    >> title:str = Text for title of TabberTab
    >> text:str = HTML text for TabberTab content

rje_html Module ToDo Wishlist

    # [ ] : List here

rje_menu [version 0.2] Generic Menu Methods Module ~ [Top]

Module: rje_menu
Description: Generic Menu Methods Module
Version: 0.2
Last Edit: 20/08/09
Imports: rje
Imported By: budapest, comparimotif_V3, gfessa, seqmapper, wormpump
Copyright © 2006 Richard J. Edwards - See source code for GNU License Notice

Function:
This module is designed to contain generic menu methods for use with any RJE Object. At least, that's the plan...

Commandline:
This module is not for standalone running and has no commandline options (including 'help'). All options are handled
by the parent module.

Uses general modules: os, string, sys
Uses RJE modules: rje
Other modules needed: None

rje_menu Module Version History

    # 0.0 - Initial Compilation.
    # 0.1 - Added infile and outfile
    # 0.2 - Added extra addcommand and printtext menu options.

rje_menu Module Methods

rje_menu.menu(callobj,headtext='',menulist=[],choicetext='Please select:',changecase=True,default='')
    Main Menu method.
    >> callobj:Object for which attributes are to be read and altered. Also controls interactivity and log.
    >> headtext:str [''] = Introductory text for menu system.
    >> menulist:list [] = List of menu item tuples (edit code,description,optiontype,optionkey)
        - e.g. ('0','Sequence file','info','Name') would edit callobj.info['Name'])
        - If optiontype == 'return' then menu will return the value given in optionkey
        - If optiontype == '' then description will be printed as a breaker
        - If optiontype == 'infile' then callobj.info['Name'] would be changed using rje.getFileName(mustexist=True)
        - If optiontype == 'outfile' then callobj.info['Name'] would be changed using rje.getFileName(confirm=True)
        - If optiontype == 'showtext' then optionkey should contain text to be printed with verbose
        - If optiontype == 'addcmd' then commands can be added.
    >> choicetext:str ['Please select:'] = Text to display for choice option
    >> changecase:boolean [True] = change all choices and codes to upper text 
    << returns optionkey if appropriate, else True

rje_menu Module ToDo Wishlist

    # [ ] : List here

rje_motif [version 3.0] Motif Class and Methods Module ~ (rje_motif_V3.py)[Top]

Module: rje_motif
Description: Motif Class and Methods Module
Version: 3.0
Last Edit: 15/02/07
Imports: rje
Imported By: presto_V5, qslimfinder, slimfinder, rje_motif_stats, rje_motiflist, rje_slimcore, rje_slimlist
Copyright © 2006 Richard J. Edwards - See source code for GNU License Notice

Function:
This module contains the Motif class for use with both Slim Pickings and PRESTO, and associated methods. This basic
Motif class stores its pattern in several forms:
- info['Sequence'] stores the original pattern given to the Motif object
- list['PRESTO'] stores the pattern in a list of PRESTO format elements, where each element is a discrete part of
the motif pattern
- list['Variants'] stores simple strings of all the basic variants - length and ambiguity - for indentifying the "best"
variant for any given match
- dict['Search'] stores the actual regular expression variants used for searching, which has a separate entry for
each length variant - otherwise Python RegExp gets confused! Keys for this dictionary relate to the number of
mismatches allowed in each variant.

The Motif Class is designed for use with the MotifList class. When a motif is added to a MotifList object, the
Motif.format() command is called, which generates the 'PRESTO' list. After this - assuming it is to be kept -
Motif.makeVariants() makes the 'Variants' list. If creating a motif object in another module, these method should be
called before any sequence searching is performed. If mismatches are being used, the Motif.misMatches() method must
also be called.

Commandline:
These options should be listed in the docstring of the module using the motif class:
- * alphabet=LIST : List of letters in alphabet of interest [AAs]
- * ambcut=X : Cut-off for max number of choices in ambiguous position to be shown as variant (0=All) [10]
- * trimx=T/F : Trims Xs from the ends of a motif [False]

Uses general modules: copy, math, os, re, string, sys
Uses RJE modules: rje
Other modules needed: None

Motif Class

    Motif Class. Author: Rich Edwards (2005).

    Info:str
    - Name = Name of motif
    - Description = Description of motif
    - Sequence = *Original* pattern given to motif
    
    Opt:boolean
    - TrimX = Trims Xs from the ends of a motif
    - Compare = Compare the motifs from the motifs FILE with the searchdb FILE (or self if None) [False]
    - MatchIC = Use (and output) information content of matched regions to asses motif matches [True]
    - MotifIC = Output Information Content for motifs [False]

    Stat:numeric
    - AmbCut = Cut-off for max number of choices in ambiguous position to be shown as variant [10]
        For mismatches, this is the max number of choices for an ambiguity to be replaced with a mismatch wildcard
    - Length = Maximum length of the motif in terms of non-wildcard positions
    - MinLength = Minimum length of the motif in terms of non-wildcard positions
    - FixLength = Maximum Length of motif in terms of fixed positions
    - FullLength = Maximum length of the motif, including wildcard positions
    - IC = Information Content of motif
    - OccNum = Number of occurrences in search database
    - OccSeq = Number of different sequences it occurs in in search database

    List:list
    - Alphabet = List of letters in alphabet of interest
    - PRESTO = Presto format motifs are strings of elements separated by '-', where each element is:
        > a single AA letter
        > a wildcard 'X'
        > a choice of letters in the form [ABC] *** NB. an "except" [^ABC] is converted to inclusive ambiguity ***
        > a choice of combinations in the form (AB|CD)
        > a start ^ or end $ of sequence marker
        > may be combined with variable numbers of positions {m,n}
    - Variants = List of string variants lists, incorporating length variation and different combos.
        => This is primarily used to determine the best "variant" match for an actual match but also as the base for mismatches.

    Dict:dictionary
    - Expect = dictionary of {key:expected number of occurrences}, where key could be a filename or Sequence object etc.
    - ExpectMM = same as Expect but for each number of mismatches {key:{mm:expect}} - PRESTO only.
    - Search = dictionary of {no. mismatches:list of variant regexps to search}

    Obj:RJE_Objects
    - MotifList = "Parent" MotifList object - contains objects of use to Motif without need to duplicate
Motif._calculateLength(self)
    Calculates Length Statistics.
Motif._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
Motif._hitStats(self,match,searchvar)
    Calculates best Variant, ID for Match.
    >> match:str = Matched part of sequence
    >> searchvar:str = regular expression variant used in original match
    << histats:dictionary of stats
Motif._msVar(self,varlist)
    Returns MS-altered variant list.
    >> varlist:list of variant regexp sequence strings
Motif._setAttributes(self)
    Sets Attributes of Object.
Motif.expectation(self,aafreq={},aanum=0,seqnum=0)
    Returns a dictionary of {mismatch:expectation}.
    >> aafreq:dictionary of AA frequencies {aa:freq}
    >> aanum:int = sum total of positions in dataset
    >> seqnum:int = number of different sequence fragments searched
    << expdict:dictionary of {mismatch:expectation}
Motif.format(self,msmode=False,reverse=False)
    Generates list.['PRESTO'] and list.['Variants'] from info['Sequence']. See docstring for details of PRESTO format.
    >> msmode:boolean [False] = whether to interpret motif as MSMS peptide sequencing.
    >> reverse:boolean [False] = whether to reverse sequence
Motif.makeVariants(self,msmode=False,ambvar=True)
    Makes self.list['Variants'] of variants and basic self.dict['Search'] with no mismatches.
    - self.list['Variants'] = with non-regexp variants of different ambiguities etc.
    - self.dict['Search'][0] = regular expressions (length variants) for searching.
    >> msmode:boolean [False] = whether to interpret motifs as MSMS peptides.
    >> ambvar:boolean [True] = whether to make full ambiguity variants for PRESTO search or just dict['Search'][0]
        - this should be set to False for CompariMotif, especially when ambcut is high
Motif.misMatches(self,mismatch={},msmode=False,trimx=False,basevar=False)
    Populates attributes with variants and regular expressions.
    - self.dict['Search'][X] = regular expressions (length variants) for searching with X mismatches.
    >> mismatch:dictionary of {mm 'X':Y aa}
    >> msmode:boolean = whether to interpret motifs as MSMS peptides.
    >> trimx:boolean = whether to trim Xs from ends of motif (for compare)
    >> basevar:boolean = whether to use self.list['Variants'] as bases rather than self.dict['Search'][0]
Motif.nofm(self,element)
    Returns list of variants for "n of m" format.
Motif.patternStats(self)
    Performs calculations based on basic pattern (info['Sequence']), adding to self.stat/info/opt.
Motif.searchSequence(self,sequence,logtext='')
    Searches the given sequence for occurrences of self and returns a list of hit dictionaries: Pos,Variant,Match
    >> sequence:str = sequence to be searched
    >> logtext:str [''] = text to precede progress printing. If '', no progress printing!
    << hitlist:list of dictionaries with hit information: Pos,Variant,Match,ID,MisMatch
Motif.slimCode(self)
    Makes a SLiMFinder slimcode and stores in self.info['Slim']. Returns code or empty string.

rje_motif_V3 Module Methods

rje_motif_V3.defineMotif(callobj=None,occlist=[],profile=False,minfreq=0.2,minocc=2,ambcut=19)
    Takes occurrences and makes motif(s) from them.
    >> callobj:Object to handle errors etc.
    >> occlist:list of instances. Can be variants (with wildcards) or not
    >> profile:boolean  = whether to return profile-esque patterns with numbers [False]
    >> minfreq:float = min freq of any aa for position to be non-wildcard [0.2]
    >> minocc:int = min number of any aa for position to be non-wildcard (in addition to minfreq) [2]
    >> ambcut:int = number of ambiguities allowed before position marked as wildcard [19]
    << redefined:str = redefined motif (or csv motifs if lengths cannot be compressed using wildcards)
rje_motif_V3.elementIC(element='',aafreq={},wild_pen=0.0,max_info=0.0,callobj=None)
    Calculates the IC for a given pattern element. See Motif.__doc__ for description of elements:
    >> pattern:str = motif pattern
    >> aafreq:dict = aa frequencies
    >> wild_pen:float = wildcard penalty
rje_motif_V3.expect(pattern,aafreq,aanum,seqnum,binomial=False,adjustlen=True)
    Returns the expected number of occurrences for given pattern. Xs and .s both count as wildcards.
    >> aafreq:dictionary of AA frequencies {aa:freq}
    >> aanum:int = sum total of positions in dataset
    >> seqnum:int = number of different sequence fragments searched
    >> binomial:bool [False] = Whether to return n & p data for binomial rather than expectation for poisson
    >> adjustlen:bool [True] = Whether to adjust no. of sites by length of motif
    << expected number of occurrences *or* (prob_per_site,num_sites) if binomial=True
rje_motif_V3.expectString(_expect)
    Returns formatted string for _expect value.
rje_motif_V3.maxInfo(aafreq)
    Calculates the maximum information content score given aafreq. Note that this is slightly misleading as this is the
    largest *negative* value, which is then subtracted from the IC measure to give the actual IC value. For fixed
    positions, the IC becomes 0.0 - max_info = - max_info
rje_motif_V3.occProb(observed,expected)
    Returns the poisson probability of observed+ occurrences, given expected.
rje_motif_V3.reformatMiniMotif(callobj=None,pattern='')
    Reformats minimotif into standard motif format.
    >> callobj:Object to handle errors etc.
    >> pattern:str = MiniMotif pattern
    << redefined:str = redefined motif 

rje_motif_stats [version 1.0] Motif Statistics Methods Module ~ [Top]

Module: rje_motif_stats
Description: Motif Statistics Methods Module
Version: 1.0
Last Edit: 01/02/07
Imports: gopher_V2, rje, rje_blast, rje_disorder, rje_motif_V3, rje_seq, rje_sequence
Imported By: presto_V5, rje_motiflist
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
This module contains the Alignment Conservation methods for motifs, as well as other calculations needing occurrence
data. This module is designed to be used by the MotifList class, which contains the relevant commandline options.

Commandline:
This module is not for standalone running and has no commandline options (including 'help'). All options are handled
by the parent module.

Uses general modules: copy, os, string, sys
Uses RJE modules: gopher_V2, rje, rje_blast, rje_disorder, rje_motif_V3, rje_seq, rje_sequence
Other modules needed: rje_seq modules

rje_motif_stats Module Methods

rje_motif_stats.absCons(callobj,Occ,hithom,seqfrag,seqwt)
    Absolute conservation score.
rje_motif_stats.bestScore(aa,aalist,posmatrix)
    >> aa:str = Amino acid to be compared
    >> aalist:str = List of aas to compare to
    >> posmatrix:dict of {'a1a2':score} to get score from
rje_motif_stats.consWeight(callobj,hitcon,seqwt)
    Weights conservation and returns final conservation score.
    >> callobj:Calling object
    >> hitcon: raw dictionary of {seq:conservation}
    >> seqwt: weighting dictionary of {seq:weighting}
    << cons: conservation score
rje_motif_stats.findFudge(qryseq,match,pos)
    >> qryseq: degapped query sequence
    >> match:match sequence 
    >> pos:position match is meant to be from 0 to L
    << fudge:int = amount to move pos to find match in qryseq. 0 = not there!
rje_motif_stats.findOccPos(callobj,Occ,qry,fudge=0)
    Finds Motif Occurence in alignment.
    >> callobj = calling MotifList object
    >> Occ = MotifOcc object
    >> qry = query Sequence object from alignment file
    >> fudge = amount to try shifting match to find occurrence is non-matching sequence
    << (start,end) = start and end position in aligment to allow sequence[start:end]
rje_motif_stats.hitAlnCon(callobj,occlist,logtext=())
    Looks for alignment and, if appropriate, calculate conservation stats.
    
    Any homologues with masked (X) residues that coincide to non-wildcard positions of the motif occurrence will be
    ignored from conservation calculations. Gaps, however, shall be treated as divergence. The exception is that when the
    alngap=F option is used, 100% gapped regions of homologues are also ignored.

    This method deals with all the occurrences of all motifs for a single sequence and its alignment. Global alignment
    statistics are calculated first, then each occurrence for each motif is processed. Since version 1.1, subtaxa are
    treated the same as all taxa to reduce the coding: the default all taxa is now effectively an additional subtaxa set.
    
    >> callobj:Object containing settings for stats generation (MotifList, generally).
    >> occlist:list of MotifOcc objects to calculate stats for (must all have same Seq)
    >> logtext:tuple of (logleader,logtext) to use as basis for log messages. (None if len != 2)
rje_motif_stats.loadOrthAln(callobj,seq,gopher=True)
    Identifies file, loads and checks alignment. If the identified file is not actually aligned, then RJE_SEQ will try to
    align the proteins using MUSCLE or ClustalW.
    >> callobj:Object containing settings for stats generation (MotifList, generally).
    >> seq:Sequence being analysed.
    >> gopher:bool [True] = whether to try to generate alignment with GOPHER if callobj.opt['Gopher']
    << aln = SeqList object containing alignment with queryseq
rje_motif_stats.occStats(callobj,occlist,logtext=())
    Calculates general occurrence stats for occlist.
    >> callobj:Object containing settings for stats generation (MotifList, generally).
    >> occlist:list of MotifOcc objects to calculate stats for (must all have same Seq)
    >> logtext:tuple of (logleader,logtext) to use as basis for log messages. (None if len != 2)
rje_motif_stats.posCons(callobj,Occ,hithom,red_aln,seqwt,posmatrix)
    Positional conservation score.
rje_motif_stats.seqDom(callobj,seq,seq_dis)
    Returns dictionary of lists of scores for different domain scores.
    >> callobj:Object controlling attributes
    >> Seq:Sequence Object
    >> seq_dis:dictionary of lists of disorder scores
    << seq_dom:dicitonary of domain scores
rje_motif_stats.statLog(callobj,logtext,extra)
    Output to log.

rje_motiflist [version 1.0] RJE Motif List Module ~ [Top]

Module: rje_motiflist
Description: RJE Motif List Module
Version: 1.0
Last Edit: 03/04/07
Imports: rje, rje_aaprop, rje_blast, rje_disorder, rje_motif_stats, rje_scoring, rje_seq, rje_sequence, rje_motif_V3, rje_uniprot, rje_motifocc
Imported By: presto_V5, rje_slimcore
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
This module contains the MotifList Class, which is designed to replace many of the functions that previously formed
part of the Presto Class. This class will then be used by PRESTO, SLiMPickings and CompariMotif (and others?) to
control Motif loading, redundancy and storage. MotifOcc objects will replace the previous PrestoSeqHit objects and
contain improved data commenting and retrieval methods. The MotifList class will contain methods for filtering motifs
according to individual or combined MotifOcc data.

The options below should be read in by the MotifList object when it is instanced with a cmd_list and therefore do not
need to be part of any class that makes use of this object unless it has conflicting settings.

The Motif Stats options are used by MotifList to calculate statistics for motif occurrences, though this data will
actually be stored in the MotifOcc objects themselves. This includes conservation statistics.

Note. Additional output parameters, such as motifaln and proteinaln settings, and stat filtering/novel scores are not
stored in this object, as they will be largely dependent on the main programs using the class, and the output from
those programs. (This also enables statfilters etc. to be used with stats not related to motifs and their occurrences
if desired.)

MotifList Commands:
## Basic Motif Input/Formatting Parameters ##
* motifs=FILE : File of input motifs/peptides [None]
Single line per motif format = 'Name Sequence #Comments' (Comments are optional and ignored)
Alternative formats include fasta, SLiMDisc output and raw motif lists.
* minpep=X : Min length of motif/peptide X aa [2]
* minfix=X : Min number of fixed positions for a motif to contain [0]
* minic=X : Min information content for a motif (1 fixed position = 1.0) [2.0]
* trimx=T/F : Trims Xs from the ends of a motif [False]
* nrmotif=T/F : Whether to remove redundancy in input motifs [False]
* minimotif=T/F : Input file is in minimotif format and will be reformatted (PRESTO File format only) [False]
* goodmotif=LIST : List of text to match in Motif names to keep (can have wildcards) []
* ambcut=X : Cut-off for max number of choices in ambiguous position to be shown as variant [10]
* reverse=T/F : Reverse the motifs - good for generating a test comparison data set [False]
* msms=T/F : Whether to include MSMS ambiguities when formatting motifs [False]

## Motif Occurrence Statistics Options ##
* winsa=X : Number of aa to extend Surface Accessibility calculation either side of motif [0]
* winhyd=X : Number of aa to extend Eisenberg Hydrophobicity calculation either side of motif [0]
* windis=X : Extend disorder statistic X aa either side of motif (use flanks *only* if negative) [0]
* winchg=X : Extend charge calculations (if any) to X aa either side of motif [0]
* winsize=X : Sets all of the above window sizes (use flanks *only* if negative) [0]
* slimchg=T/F : Calculate Asolute, Net and Balance charge statistics (above) for occurrences [False]
* iupred=T/F : Run IUPred disorder prediction [False]
* foldindex=T/F : Run FoldIndex disorder prediction [False]
* iucut=X : Cut-off for IUPred results (0.0 will report mean IUPred score) [0.0]
* iumethod=X : IUPred method to use (long/short) [short]
* domfilter=FILE : Use the DomFilter options, reading domains from FILE [None] ?? Check how this works ??
* ftout=T/F : Make a file of UniProt features for extracted parent proteins, where possible, incoroprating SLIMs [*.features.tdt]
* percentile=X : Percentile steps to return in addition to mean [0]

## Conservation Parameters ## ??? Add separate SlimCons option ???
* usealn=T/F : Whether to search for and use alignemnts where present. [False]
* gopher=T/F : Use GOPHER to generate missing orthologue alignments in alndir - see gopher.py options [False]
* alndir=PATH : Path to alignments of proteins containing motifs [./] * Use forward slashes (/)
* alnext=X : File extension of alignment files, accnum.X [aln.fas]
* alngap=T/F : Whether to count proteins in alignments that have 100% gaps over motif (True) or (False) ignore
as putative sequence fragments [False] (NB. All X regions are ignored as sequence errors.)
* conspec=LIST : List of species codes for conservation analysis. Can be name of file containing list. [None]
* conscore=X : Type of conservation score used: [pos]
- abs = absolute conservation of motif using RegExp over matched region
- pos = positional conservation: each position treated independently
- prop = conservation of amino acid properties
- all = all three methods for comparison purposes
* consamb=T/F : Whether to calculate conservation allowing for degeneracy of motif (True) or of fixed variant (False) [True]
* consinfo=T/F : Weight positions by information content (does nothing for conscore=abs) [True]
* consweight=X : Weight given to global percentage identity for conservation, given more weight to closer sequences [0]
- 0 gives equal weighting to all. Negative values will upweight distant sequences.
* posmatrix=FILE : Score matrix for amino acid combinations used in pos weighting. (conscore=pos builds from propmatrix) [None]
* aaprop=FILE : Amino Acid property matrix file. [aaprop.txt]

## Alignment Settings ##
* protalndir=PATH : Output path for Protein Alignments [ProteinAln/]
* motalndir=PATH : Output path for Motif Alignments []
* flanksize=X : Size of sequence flanks for motifs [30]
* xdivide=X : Size of dividing Xs between motifs [10]

## System Settings ##
* iupath=PATH : The full path to the IUPred exectuable [c:/bioware/iupred/iupred.exe]
?? * memsaver=T/F : Whether to store all results in Objects (False) or clear as search proceeds (True) [True] ??
?- should this be controlled purely by the calling program? Probably!
* fullforce=T/F : Whether to force regeneration of alignments using GOPHER

Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje, rje_aaprop, rje_disorder, rje_motif_V3, rje_motif_cons, rje_scoring, rje_seq, rje_sequence,
rje_blast, rje_uniprot
Other modules needed: rje_dismatrix,

MotifList Class

    Motif List Class for use with PRESTO, SLiMPickings etc. Author: Rich Edwards (2007). Based on PRESTO 4.2.

    Info:str
    - Name = Name of input motif/peptide file
    - MotifOut = Filename for output of reformatted (and filtered?) motifs in PRESTO format [None]
    - DomFilter = Use the DomFilter options, reading domains from FILE [None]
    - AlnDir = Path to alignment files
    - AlnExt = File extensions of alignments: AccNum.X
    - ConScore = Type of conservation score used:  [abs]
        - abs = absolute conservation of motif: reports percentage of homologues in which conserved
        - prop = conservation of amino acid properties
        - pos = positional conservation: each position treated independently 
        - all = all three methods for comparison purposes
    - PosMatrix = Score matrix for amino acid combinations used in pos weighting. (conscore=pos builds from propmatrix)
    - ProtAlnDir = Directory name for output of protein aligments [ProteinAln/]
    - MotAlnDir = Directory name for output of protein aligments []
    
    Opt:boolean
    - NRMotif = Whether to remove redundancy in input motifs [False]
    - Expect = Whether to calculate crude 'expected' values based on AA composition.
    - MSMS = Whether to run in MSMS mode
    - UseAln = Whether to look for conservation in alignments
    - Reverse = Reverse the motifs - good for generating a test comparison data set [False]
    - IUPred = Run IUPred disorder prediction [False]
    - FoldIndex = Run FoldIndex disorder prediction [False]
    - ConsInfo = Weight positions by information content [True]
    - AlnGap = Whether to count proteins in alignments that have 100% gaps over motif (True) or (False) ignore as putative sequence fragments [True]
    - Gopher = Use GOPHER to generate missing orthologue alignments in outdir/Gopher - see gopher.py options [False]
    - ConsAmb = Whether to calculate conservation allowing for degeneracy of motif (True) or of fixed variant (False) [True]
    - TrimX = Trims Xs from the ends of a motif
    - SlimChg = Calculate Asolute, Net and Balance charge statistics (above) for occurrences [False]
    - MiniMotif = Input file is in minimotif format and will be reformatted [False]
    - Compare = whether being called by CompariMotif (has some special requirements!)
    - FullForce = Whether to force regeneration of alignments using GOPHER
    
    Stat:numeric
    - AmbCut = Cut-off for max number of choices in ambiguous position to be shown as variant [10]
        For mismatches, this is the max number of choices for an ambiguity to be replaced with a mismatch wildcard
    - MinPep = Minimum length of motif/peptide (non-X characters)
    - MinFix = Min number of fixed positions for a motif to contain [0]
    - MinIC = Min information content for a motif (1 fixed position = 1.0) [2.0]
    - WinSA = Number of aa to extend Surface Accessibility calculation either side of motif [0]
    - WinHyd = Number of aa to extend Eisenberg Hydrophobicity calculation either side of motif [0]
    - WinDis = Extend disorder statistic X aa either side of motif [0]
    - WinChg = Extend charge calculations (if any) to X aa either side of motif [0]
    - WinSize = Used for peptide design and also to set all of above [0]
    - ConsWeight = Weight given to global percentage identity for conservation, given more weight to closer sequences [0]
    - FlankSize = Size of sequence flanks for motifs in MotifAln [30]
    - XDivide = Size of dividing Xs between motifs [10]
    - FTOut = Make a file of UniProt features for extracted parent proteins, where possible, incoroprating SLIMs [*.features.tdt]
    - Percentile = Percentile steps to return in addition to mean [25]

    List:list
    - Alphabet = List of letters in alphabet of interest
    - Motifs = List of rje_Motif.Motif objects
    - MotifOcc = List of MotifOcc objects 
    - GoodMotif = List of text to match in Motif names to keep (can have wildcards) []

    Dict:dictionary
    - ConsSpecLists = Dictionary of {BaseName:List} lists of species codes for special conservation analyses
    - ElementIC = Dictionary of {Position Element:IC}
    - PosMatrix = Score matrix for amino acid combinations used in pos weighting. (conscore=pos builds from propmatrix) {}

    Obj:RJE_Objects
    - AAPropMatrix = rje_aaprop.AAPropMatrix object
    - SeqList = SeqList object of search database
MotifList.FTOut(self,acc_occ)
    Produces a single file for the dataset of extracted UniProt features with the motif positions added.
    >> acc_occ:dictionary of {AccNum:{Motif:[(position:match)]}
MotifList.OLDstatFilter(self,datadict)
    Returns True if motif should be kept according to self.obj['Presto'].list['StatFilter']. 
    >> datadict:dictionary of hit stats
    << True/False if accepted/filtered
MotifList._addMotif(self,name,seq,reverse=False,check=False,logrem=True)
    Adds new motif to self.list['Motifs']. Checks redundancy etc.
    >> name:str = Motif Name
    >> seq:str = Motif Sequence read from file
    >> reverse:boolean [False] = whether to reverse sequence
    >> check:boolean [False] = whether to check redundancy and sequence length
    >> logrem:boolean [True] = whether to log removal of motifs
    << returns Motif object or None if failed
MotifList._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
MotifList._reformatMiniMotif(self,mlines)
    Reformats MiniMotif file, compressing motifs as appropriate.
    >> mlines:list of lines read from input file
MotifList._setAttributes(self)
    Sets Attributes of Object.
MotifList.addOcc(self,Seq=None,Motif=None,data={},merge=False)
    Adds a MotifOcc object with the data given.
    >> Seq:Sequence object against in which the occurrence lies
    >> Motif:Motif object
    >> data:dictionary of data to add {'Info':{},'Stat':{},'Data':{}}
    >> merge:bool [False] = whether to check for existing occurrence and merge if found
MotifList.checkForOcc(self,Occ,merge=True)
    Returns existing MotifOcc if one is found, else given Occ.
    >> Occ:MotifOcc object to check
    >> merge:bool [True] = whether to merge Occ with found occurrence, if there is one
MotifList.clear(self)
    Clears objects and lists.
MotifList.combMotifOccStats(self,statlist=[],revlist=[],log=True,motiflist=[])
    Combines mean and percentile stats for the Occurrences of a Motif.
    >> statlist:list of stats to combine from occurrences
    >> revlist:list of stats that should be ordered from low(best) to high(worst) rather than the other way round
    >> log:boolean [True] = whether to log progress of stat combination
MotifList.loadMotifs(self,file=None,clear=True)
    Loads motifs and populates self.list['Motifs'].
    >> file:str = filename or self.info['Name'] if None
    >> clear:boolean = whether to clear self.list['Motifs'] before loading [True]
MotifList.mapMotif(self,Occ,update=True)
    Returns Motif Object and updates Occ, based on self.motifs() and Occ data.
    >> Occ:MotifOcc object to check
    >> update:bool [True] = whether to update own list['Motifs'] and/or Motif objects
MotifList.mapPattern(self,pattern,update=True)
    Returns Motif Object based on self.motifs(). Will add if update=True and pattern missing.
    >> pattern:str = motif pattern to check
    >> update:bool [True] = whether to update own list['Motifs'] and/or Motif objects
MotifList.mergeOcc(self,occlist,overwrite=False)
    Adds Info, Stat and Data from occlist[1:] to occlist[0]. If overwrite=False, then the occurrences should be in
    order of preferred data, as only missing values will be added. If overwrite=True, then later MotifOcc in the list
    will overwrite the values of the earlier ones. (Note that it is always the first MotifOcc object that is changed.
    Note also that no checks are made that the objects *should* be merged. (See self.checkForOcc())
    >> occlist:list of MotifOcc to merge. Will add to occlist[0] from occlist[1:]
    >> overwrite:bool [False] = If False, will only add missing Info/Stat/Data entries. If True, will add all.
MotifList.motifAlignments(self,resfile='motifaln.fas')
    Makes motif alignments from occurrences. MotifOcc objects should have Sequence objects associated with them. If
    necessary, add a method to go through and generate Sequence objects using MotifOcc.info['FastaCmd'].
    >> resfile:str = Name of output file
MotifList.motifAlnLong(self,Motif,seq_occ,append=False,memsaver=False,resfile='')
    Makes a single MotifAln output.
    >> Motif:Motif Object
    >> seq_occ:dictionary of {Seq:[MotifOcc]}
    >> append:bool [False] = whether to append file or create new
    >> memsaver:bool [False] = whether output is to be treated as a single sequence or all occurrences
    >> resfile:str [''] = name for motif alignment output file
MotifList.motifInfo(self,expfile=None,statlist=[])
    Produces summary table for motifs, including expected values and information content.
    >> expfile:str = Filename from which expectations have been calculated and should be output
    >> statlist:list of extra statistics to output
MotifList.motifNum(self)


MotifList.motifOcc(self,byseq=False,justdata=None,fastacmd=False,nested=True)
    Returns a list or dictionary of MotifOccurrences.
    >> byseq:bool [False] = return a dictionary of {Sequence:{Motif:OccList}}, else {Motif:{Sequence:OccList}} 
    >> justdata:str [None] = if given a value, will return this data entry for each occ rather than the object itself
    >> fastacmd:bool [False] = whether to return FastaCmd instead of Sequence if Sequence missing
    >> nested:bool [True] = whether to return a nested dictionary or just the occurrences per Motif/Seq
    << returns dictionary or plain list of MotifOcc if byseq and bymotif are both False    
MotifList.motifOut(self,filename='None',motlist=[])
    Outputs motifs in PRESTO format. #!# Check exactly how different formats are stored etc. #!#
    >> filename:str [None] = Name for output file. Will not output if '' or 'None'.
    >> motlist:list of Motif objects to output. If [], will use self.list['Motifs']
MotifList.motifs(self)


MotifList.outputs(self)
    Processes addition outputs for motif occurrences (Alignments and Features).
MotifList.patternStats(self,log=False)
    Performs calculations all motifs based on basic pattern (info['Sequence']), adding to Motif.stat/info/opt.
MotifList.posMatrix(self)
    Loads and builds PosMatrix for Conservation Scoring.
MotifList.proteinAlignments(self,alndir='',hitname='AccNum')
    Generates copies of protein alignments, with motif hits marked.
MotifList.rankMotifs(self,stat,cutoff=0)
    Reranks Motifs using stat and reduces to cutoff if given.
    >> stat:str = Stat to use for ranking
    >> cutoff:int [0] = number of top ranks to keep
MotifList.removeMotif(self,Motif,remtxt='')
    Removes motif and occurrences from self.
    >> Motif:Motif object to remove
    >> remtxt:str = Text to output to log
MotifList.seqExp(self,seq=None)
    Populates self.dict[Expect].
    >> seq:Sequence object to consider
MotifList.seqListExp(self,seqlist=None,filename='',cutoff=0)
    Populates self.dict[Expect].
    >> seqlist:SeqList object to consider
    >> filename:Sequence files to consider (must be fasta)
    >> cutoff:float [0] = Expectation cutoff, will remove motif if exceeded.
MotifList.setupDomFilter(self)
    Sets up self.dict['DomFilter'].
MotifList.setupGopher(self)
    Sets up GOPHER directory etc.
MotifList.singleProteinAlignment(self,seq,occlist,alndir='',hitname='AccNum',gopher=True,savefasta=True)
    Generates copies of protein alignment, with motif hits marked. Return SeqList object.
    >> seq:Sequence object for alignment
    >> occlist:list of MotifOcc objects for sequence
    >> alndir:str = Alignment directory for output
    >> hitname:str = Format of hitnames, used for naming files
    >> gopher:boolean = whether to look to use Gopher if settings correct (set this False if done already)
    >> savefasta:boolean [True] = whether to save fasta file or simply return SeqList object alone.
MotifList.statFilter(self,occdata={},statfilter={})
    Filters using rje_scoring and removes filtered MotifOcc.
    >> occdata:dictionary of {Occ:Statdict}
    >> statfilter:dictionary to statfilters
    << occdata: returns reduced occdata dictionary (same object - pre-reassign if original required!)
MotifList.statFilterMotifs(self,statfilter)
    Filters motifs using statfilter.

rje_motiflist Module Methods

rje_motiflist.runMain()




rje_motifocc [version 0.0] Motif Occurrence Module ~ [Top]

Module: rje_motifocc
Description: Motif Occurrence Module
Version: 0.0
Last Edit: 29/01/07
Imports: rje, rje_seq, rje_sequence
Imported By: rje_motiflist
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
This module contains the MotifOcc class. This class if for storing methods and attributes pertinent to an individual
occurrence of a motif, i.e. one Motif instance in one sequence at one position. This class is loosely based on (and
should replace) the old PRESTO PrestoHit object. (And, to some extent, the PrestoSeqHit object.) This class is
designed to be flexible for use with PRESTO, SLiMPickings and CompariMotif, among others.

In addition to storing the standard info and stat dictionaries, this object will store a "Data" dictionary, which
contains the (program-specific) data to be output for a given motif. All data will be in string format. The
getData() and getStat() methods will automatically convert from string to numerics as needed.

Commandline:
This module has no standalone functionality and should not be called from the commandline.

Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje, rje_seq, rje_sequence
Other modules needed: None

MotifOcc Class

    Motif Occurrence Class. Author: Rich Edwards (2007).

    Note that, unlike other RJE_Object classes, this list of attributes will not be created for all runs with default
    values. Instead, defaults are return for missing values using the getStat() and getData() methods. 

    The list of attributes here is for reference when using other programs. Please try to be consistent across calling
    programs (e.g. PRESTO and SLiM Pickings) for management issues but this is not enforced anywhere. (Some conservation
    and disorder etc. functions will, however, need and return set attributes.

    Note also that some of the original attributes of the PrestoHit object are no longer stored as they can be
    retrieved from the Motif and Sequence object associated with the MotifOcc.

    Info:str
    - Name   : The name of the regular expression given in the input file.
    - SearchVar : The actual RegExp making the match
    - Variant: The sequence of the regular expression variant that is closest to the match.
    - Match  : The sequence of the matched region in the target sequence.
    - Expect : A crude measure of how often this MATCH would be expected by chance in this database. Formatted.
    - PepSeq : Peptide sequence if presto.opt['Peptides'] = True
    
    Stat:numeric
    - ID    : The percentage identity between VARIANT and MATCH. (Always 1 unless mismatches allowed!)
    - Cons  : Conservation of motif across alignment (usealn=T).
    - Hom   : Number of homologues used in conservation score.
    - GlobID: Mean Global %ID between Query and Homologues
    - LocID : Mean Local %ID between Query and Homologues across motif
    - Pos   : The start position of the MATCH in the HIT. (1-n)
    - SA    : The mean Surface Accessibility score (Janin et al) for the matched region
    - Hyd   : The mean Eisenberg hydrophobicity score for the matched region
    *In addition, when MS-MS mode is used, there are the following additional stats:
    - SNT      : The minimum number of single nucleotide substitutions needed to change a stretch of DNA encoding the
                 MATCH into one encoding the VARIANT. (Always 0 unless mismatches allowed! This is most useful when
                 searching a database that may contain sequencing errors or, for MSMS peptides, proteins from a different
                 species.)
    - ORFMwt   : The predicted molecular weight of HIT.
    - M-ORFMWt : The predicted molecular weight of HIT starting with the first Methionine.
    - FragMWt  : The predicted molecular weight of the tryptic fragment of HIT containing MATCH.
    - N[KR]    : The number of amino acids N-terminal of MATCH before reaching a K or R. (A value of 1 indicates that
                 MATCH includes the start of a tryptic fragment.)
    - C[KR]    : The number of amino acids C-terminal of MATCH before reaching a K or R. (A value of 0 indicates that
                 MATCH includes the end of a tryptic fragment.)
    - Rating   : A crude score to help rank hits. The higher the better. [Details to be added later.]

    Dict:
    - Data = String values of output data (can also be in Stat or Info)
    
    Obj:RJE_Objects
    - Seq = rje_sequence.Sequence object
    - Motif = rje_motif.Motif object
MotifOcc._addSeq(self,seqname,fasdb)
    Fishes seqname from fasdb (using fastacmd) and sets as self.obj['Seq'].
MotifOcc._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
MotifOcc._setAttributes(self)
    Sets Attributes of Object. Unlike other RJE classes, most of these are optional.
MotifOcc.hit(self)


MotifOcc.memSaver(self)
    Cuts down attributes to bare minimum. Stores sequence name in self.info['FastaCmd']
MotifOcc.occData(self)
    Expands data dictionary for output.

rje_motifocc Module Methods

rje_motifocc.runMain()




rje_pam [version 1.2] Contains Objects for PAM matrices ~ [Top]

Module: rje_pam
Description: Contains Objects for PAM matrices
Version: 1.2
Last Edit: 16/11/10
Imports: rje
Imported By: gasp, presto_V5, peptide_dismatrix, peptide_stats, rje_ancseq, rje_seq
Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice

Function:
This module handles functions associated with PAM matrices. A PAM1 matrix is read from the given input file and
multiplied by itself to give PAM matrices corresponding to greater evolutionary distance. (PAM1 equates to one amino acid
substitition per 100aa of sequence.)

Commandline:
* pamfile=X : Sets PAM1 input file [jones.pam]
* pammax=X : Initial maximum PAM matrix to generate [100]
* pamcut=X : Absolute maximum PAM matrix [1000]

Alternative PAM matrix commands:
* altpam=FILE : Alternative to PAM file input = matrix needing scaling by aafreq [None]
* seqin=FILE : Sequence file from which to calculate AA freq for scaling [None]
* pamout=FILE : Name for rescaled PAM matrix output [*.pam named after altpam=FILE]

Classes:
PamCtrl(rje.RJE_Object):
- Controls a set of PAM matrices.
PAM(pam,rawpamp,alpha):
- Individual PAM matrix.
>> pam:int = PAM distance
>> rawpamp:dictionary of substitution probabilities
>> alpha:list of amino acids (alphabet)

Uses general modules: os, string, sys, time
Uses RJE modules: rje

PAM Class



PAM.__init__(self,pam,rawpamp,alpha)
    Individual PAM matrix.
    >> pam:int = PAM distance
    >> rawpamp:dictionary of substitution probabilities
    >> alpha:list of amino acids (alphabet)
PAM.getPam(self)


PAM.pamAdjust(self)


PAM.pamCheck(self)


PAM.pamL(self,ancseq,descseq)


PAM.setPam(self,x)


PAM.setPamp(self,x)




PamCtrl Class

    PamCtrl Class. Author: Rich Edwards (2005).
    Controls a set of PAM matrices.

    Info:str
    - Name = Name (filename) [jones.pam]
    - AltPam = Alternative to PAM file input = matrix needing scaling by aafreq [None]
    - SeqIn = Sequence file from which to calculate AA freq for scaling (*Must be FASTA*) [None]
    - PamOut = Name for rescaled PAM matrix output [*.pam named after altpam=FILE]
    
    Opt:boolean
    - X-Value = Whether X is included in matrix probabilities [True]
    - GapValue = Whether - is included in matrix probabilities [True]

    Stat:numeric
    - PamCut = Absolute Upper Limit for PAM [1000]
    - PamMax = Initial Limit for PAM [100]

    Obj:RJE_Objects
    - None

    Other:
    - matrix:list       # List of matrices (PAM Objects)
    - alphabet:list     # List of letters for use in PAM matrices (String)
PamCtrl._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
PamCtrl._getPamL(self,pam=0,ancseq='',descseq='')
    Returns probability for given PAM and given sequences. Used in PamML.
    >> pam:int = PAM matrix
    >> ancseq:str = ancestral sequence
    >> descseq:str = descendant sequence
    << p:float = probability
PamCtrl._pamMLSlow(self,desc,ancseq,descseq)
    Creeps up through PAM to find best.
    >> desc:str = description
    >> ancseq:str = ancestral sequence
    >> descseq:str = descendant sequence
    << sp:int = PAM distance
PamCtrl._setAttributes(self)
    Sets Attributes of Object:
    - Info:str ['Name','Type','AltPam','SeqIn','PamOut']
    - Stats:float ['PamCut','PamMax']
    - Opt:boolean ['X-Value','GapValue']
    - Obj:RJE_Object []
    - Other: matrix, alphabet
PamCtrl.altPAM(self)
    Alternative PAM matrix construction.
PamCtrl.buildPam(self)
    Builds PAM matrix in memory.
PamCtrl.pamML(self,desc='pamML',ancseq='',descseq='')
    Calculates PAM distance between two sequences.
    >> desc:str = description
    >> ancseq:str = ancestral sequence
    >> descseq:str = descendant sequence
    << pamml:int = PAM distance
PamCtrl.pamUp(self)
    Makes PAM matrix upto self.stat['PamMax'].

rje_pam Module Methods

rje_pam.runMain()




rje_scoring [version 0.0] Scoring and Ranking Methods for RJE Python Modules ~ [Top]

Module: rje_scoring
Description: Scoring and Ranking Methods for RJE Python Modules
Version: 0.0
Last Edit: 22/01/07
Imports: rje
Imported By: aphid, presto_V5, qslimfinder, slimfinder, slimsearch, slim_pickings, rje_db, rje_motiflist, rje_slim, rje_slimcalc, rje_slimcore
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
This module contains methods only for ranking, filtering and generating new scores from python dictionaries. At its
conception, this is for unifying and clarifying the new scoring and filtering options used by PRESTO & SLiMPickings,
though it is conceived that the methods will also be suitable for use in other/future programs.

The general format of expected data is a list of column headers, on which data may be filtered/ranked etc. or
combined to make new scores, and a dictionary containing the data for a given entry. The keys for the dictionary
should match the headers in a *case-insensitive* fashion. (The keys and headers will not be changed but will match
without using case, so do not have two case-sensitive variables, such as "A" and "a" unless they have the same
values.) !NB! For some methods, the case should have been matched.

Methods in this module will either return the input dictionary or list with additional elements (if calculating new
scores) or take a list of data dictionaries and return a ranked or filtered list.

Methods in this module:
* setupStatFilter(callobj,statlist,filterlist) = Makes StatFilter dictionary from statlist and filterlist
* statFilter(callobj,data,statfilter) = Filters data dictionary according to statfilter dictionary.
* setupCustomScores(callobj,statlist,scorelist,scoredict) = Checks and returns Custom Scores and related lists

Commandline:
This module is not for standalone running and has no commandline options (including 'help'). All options are handled
by the parent modules.

Uses general modules: copy, os, string, sys
Uses RJE modules: rje
Other modules needed: None

rje_scoring Module Methods

rje_scoring.adjustedProb(scorelist,reverse=False)
    Returns the adjust probability value for each score.
    >> scorelist:list of scores (low is bad)
    >> reverse:bool [False] = reverse so that low is good!
rje_scoring.rankObj(callobj,objlist,dkey,dlist=['stat'],case=False,default=None,rev=True,absolute=True,lowest=True,addstat='Rank',cutoff=0,convert=True)
    Ranks objects using data in object and rje.getData()
    >> objlist:list of objects to be ranked, ordered and returned
    >> dkey:str = Key for dictionaries
    >> dlist:list [self.stat] = list of dictionaries to try after dict['Data']
    >> case:bool [False] = whether to match case for dkey
    >> default [None] = what value to give entry if no dictionary has key (if None, will not be returned in ranked list)
    >> rev:bool [True] = whether high values should be ranked number 1
    >> absolute:boolean [True] = return 1 to n, rather than 0 to 1
    >> lowest:boolean [True] = returns lowest rank rather mean rank in case of ties
    >> addstat:str ['Rank'] = key for callobj.stat to add rank to (will not add if None)
    >> cutoff:int [0] = only returns top X ranked motifs (if given)  (Can be float if absolute=False)
    >> convert:bool [True] = convert returned data into numeric
    << returns list of ranked objects
rje_scoring.setupCustomScores(callobj,statlist=[],scorelist=[],scoredict={})
    Sets up Custom Scores using existing statlist.
    >> callobj:RJE_Object [None] = calling object for Error Messages etc.
    >> statlist:list of stats that are allowed for custom score. Generally column headers for output.
    >> scorelist:list of Custom Score Names in order they were read in (may use prev. scores)   
    >> scoredict:dictionary of Custom Scores: {Name:Formula}
    << (statlist,scorelist,scoredict):(list,list,dictionary) of acceptable Custom Scores ([Stats],[Names],{Name:Formula})
rje_scoring.setupStatFilter(callobj,statlist=[],filterlist=[])
    Makes StatFilter dictionary from statlist and filterlist (from cmd_list) !!! Changes case of statfilter keys. !!!
    >> callobj:RJE_Object [None] = calling object for Error Messages etc.
    >> statlist:list of stats that are allowed for filtering. Generally column headers for output.
    >> filterlist:list of StatFilters read in from commandline consisting of StatOperatorValue 
    << statfilter:dictionary of StatFilter {Stat:(Operator,String,Numeric)}
rje_scoring.statFilter(callobj,data={},statfilter={},inverse=False,filtermissing=False)
    Filters data according to statfilter and returns filtered data. Filtering is *exclusive* based on statfilter.
    >> callobj:RJE_Object [None] = calling object for Error Messages etc.
    >> data:dictionary of data on which to filter {AnyKey:{stat:value}}
    >> statfilter:dictionary of StatFilter {Stat:(Operator,String,Numeric)}
    >> inverse:bool = Whether to reverse filter [False]
    >> filtermissing:bool [False] = whether to treat missing data as a "fail" and filter it [False]
    << data:dictionary of filtered data. This is a *the same* dictionary and must be pre-reassigned if original needed.
rje_scoring.statFilterObj(callobj,objlist,statfilter={})
    Filters data according to statfilter and returns filtered data. Filtering is *exclusive* based on statfilter.
    >> callobj:RJE_Object [None] = calling object for Error Messages etc.
    >> objlist:list of objects to be filtered
    >> statfilter:dictionary of StatFilter {Stat:(Operator,String,Numeric)}
    << objlist:list of filtered objects. This is a *the same* list and must be pre-reassigned if original needed.

RJE_SEQ [version 3.12] DNA/Protein sequence list module ~ [Top]

Program: RJE_SEQ
Description: DNA/Protein sequence list module
Version: 3.12
Last Edit: 11/07/11
Imports: rje, rje_blast, rje_dismatrix_V2, rje_pam, rje_sequence, rje_uniprot, rje_zen, rje_slimcalc
Imported By: aphid, badasp, budapest, comparimotif_V3, fiesta, gablam, gasp, gfessa, gopher_V2, happi, haqesac, multihaq, picsi, pingu, presto_V5, proteinAttributes, qslimfinder, seqmapper, slimfinder, slimsearch, unifake, compass, peptide_dismatrix, peptide_stats, prodigis, rje_dbase, rje_pattern_discovery, rje_phos, rje_seqgen, rje_seqplot, rje_yeast, seqforker, sfmap2png, slim_pickings, RankByDistribution, ned_SLiMPrints, ned_SLiMPrints_Tester, ned_conservationScorer, ned_rankbydistribution, rje_biogrid, rje_codons, rje_ensembl, rje_genemap, rje_hprd, rje_iridis, rje_markov, rje_motif_cons, rje_motif_stats, rje_motiflist, rje_motifocc, rje_paml, rje_seqlist, rje_slimcalc, rje_slimcore, rje_slimhtml, rje_slimlist, rje_tm, rje_tree, rje_tree_group, slimjim, slimprints
Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice

Function:
Contains Classes and methods for sets of DNA and protein sequences.

Sequence Input/Output Commands:
* seqin=FILE : Loads sequences from FILE (fasta,phylip,aln,uniprot or fastacmd names from fasdb) [None]
* query=X : Selects query sequence by name [None]
* acclist=LIST : Extract only AccNums in list. LIST can be FILE or list of AccNums X,Y,.. [None]
* fasdb=FILE : Fasta format database to extract sequences from [None]
* mapseq=FILE : Maps sequences from FILE to sequences of same name [None]
* mapdna=FILE : Map DNA sequences from FILE onto sequences of same name in protein alignment [None]
* seqout=FILE : Saves 'tidied' sequences to FILE after loading and manipulations [None]
* reformat=X : Outputs sequence in a particular format, where X is:
- fasta/fas/phylip/scanseq/acclist/acc/idlist/fastacmd/teiresias/mysql/nexus/3rf/6rf/est6rf [None]
- if no seqout=FILE given, will use input file name as base and add appropriate exension.
#!# reformat=X may not be fully implemented. Report erroneous behaviour! #!#
* logrem=T/F : Whether to log removed sequences [True] - suggest False with filtering of large files!

Sequence Loading/Formatting Options:
* alphabet=LIST : Alphabet allowed in sequences [standard 1 letter AA codes]
* replacechar=T/F : Whether to remove numbers and replace characters not found in the given alphabet with 'X' [True]
* autofilter=T/F : Whether to automatically apply sequence filters etc. upon loading sequence [True]
* autoload=T/F : Whether to automatically load sequences upon initiating object [True]
* memsaver=T/F : Minimise memory usage. Input sequences must be fasta. [False]
* degap=T/F : Degaps each sequence [False]
* tidygap=T/F : Removes any columns from alignments that are 100% gap [True]
* ntrim=X : Trims of regions >= X proportion N bases (X residues for protein) [0.0]
* gnspacc=T/F : Convert sequences into gene_SPECIES__AccNum format wherever possible. [False]
* seqtype=X : Force program to read as DNA, RNA, Protein or Mixed (case insensitive; read=Will work it out) [None]
* dna=T/F : Alternative identification of sequences as DNA [False]
* mixed=T/F : Whether to allow auto-identification of mixed sequences types (else uses first seq only) [False]
* align=T/F : Whether the sequences should be aligned. Will align if unaligned. [False]
* rna2dna=T/F : Converts RNA to DNA [False]
* trunc=X : Truncates each sequence to the first X aa. (Last X aa if -ve) (Useful for webservers like SingalP.) [0]
* usecase=T/F : Whether to output sequences in mixed case rather than converting all to upper case [False]
* case=LIST : List of positions to switch case, starting with first lower case (e.g case=20,-20 will have ends UC) []
* countspec=T/F : Generate counts of different species and output in log [False]

Sequence Filtering Commands:
* filterout=FILE : Saves filtered sequences (as fasta) into FILE. *NOTE: File is appended if append=T* [None]
* minlen=X : Minimum length of sequences [0]
* maxlen=X : Maximum length of sequences (<=0 = No maximum) [0]
* maxgap=X : Maximum proportion of sequence that may be gaps (<=0 = No maximum) [0]
* maxx=X : Maximum proportion of sequence that may be Xs (<=0 = No maximum; >=1 = Absolute no.) [0]
* maxglob=X : Maximum proportion of sequence predicted to be ordered (<=0 = None; >=1 = Absolute) [0]
* minorf=X : Minimum ORF length for a DNA/EST translation (reformatting only) [0]
* minpoly=X : Minimum length of poly-A tail for 3rf / 6rf EST translation (reformatting only) [20]
* gapfilter=T/F : Whether to filter gappy sequences upon loading [True]
* nosplice=T/F : If nosplice=T, UniProt splice variants will be filtered out [False]
* dblist=LIST : List of databases in order of preference (good to bad)
[sprot,ipi,uniprot,trembl,ens_known,ens_novel,ens_scan]
* dbonly=T/F : Whether to only allow sequences from listed databases [False]
* unkspec=T/F : Whether sequences of unknown species are allowed [True]
* accnr=T/F : Check for redundant Accession Numbers/Names on loading sequences. [True]
* seqnr=T/F : Make sequences Non-Redundant [False]
* nrid=X : %Identity cut-off for Non-Redundancy (GABLAMO) [100.0]
* nrsim=X : %Similarity cut-off for Non-Redundancy (GABLAMO) [None]
* nralign=T/F : Use ALIGN for non-redundancy calculations rather than GABLAMO [False]
* specnr=T/F : Non-Redundancy within same species only [False]
* querynr=T/F : Perform Non-Redundancy on Query species (True) or limit to non-Query species (False) [True]
* nrkeepann=T/F : Append annotation of redundant sequences onto NR sequences [False]
* goodX=LIST : Filters where only sequences meeting the requirement of LIST are kept.
LIST may be a list X,Y,..,Z or a FILE which contains a list [None]
- goodacc = list of accession numbers
- goodseq = list of sequence names
- goodspec = list of species codes
- gooddb = list of source databases
- gooddesc = list of terms that, at least one of which must be in description line
* badX=LIST : As goodX but excludes rather than retains filtered sequences

System Info Commands: * Use forward slashes for paths (/)
* blastpath=PATH : Path to BLAST programs ['']
* fastapath=PATH : Path to FASTA programs ['']
* clustalw=PATH : Path to CLUSTALW program ['clustalw']
* mafft=PATH : Path to MAFFT alignment program ['mafft']
* muscle=PATH : Path to MUSCLE ['muscle']
* fas=PATH : Path to FSA alignment program ['fsa']
* win32=T/F : Run in Win32 Mode [False]
* alnprog=X : Choice of alignment program to use (clustalw/muscle/mafft/fsa) [muscle]

Sequence Manipulation/Function Commands:
pamdis : Makes an all by all PAM distance matrix
* split=X : Splits file into numbered files of X sequences. (Useful for webservers like TMHMM.)
* relcons=FILE: Returns a file containing Pos AbsCons RelCons [None]
* relconwin=X : Window size for relative conservation scoring [30]
* makepng=T/F : Whether to make RelCons PNG files [False]
* seqname=X : Output sequence names for PNG files etc. (short/Name/Number/AccNum/ID) [short]

DisMatrix Options:
* outmatrix=X : Type for output matrix - text / mysql / phylip

Special:
* blast2fas=FILE1,FILE2,...,FILEn : Will blast sequences against list of databases and compile a fasta file of results per query
- use options from rje_blast.py for each individual blast (blastd=FILE will be over-ridden)
- saves results in AccNum.blast.fas and will append existing files!
* haqbat=FILE : Generate a batch file (FILE) to run HAQESAC on generated BLAST files, with seqin as queries [None]

Classes:
SeqList(rje.RJE_Object):
- Sequence List Class. Holds a list of Sequence Objects and has methods for manipulation etc.
Sequence(rje_sequence.Sequence):
- Individual Sequence Class.
DisMatrix(rje_dismatrix.DisMatrix):
- Sequence Distance Matrix Class.

Uses general modules: copy, math, os, random, re, shutil, sre_constants, string, sys, time
Uses RJE modules: rje, rje_blast, rje_dismatrix, rje_pam, rje_sequence, rje_uniprot

rje_seq Module Version History

    # 0.0 - Initial Compilation.
    # 0.1 - Renamed major attributes
    # 0.2 - New implementation on more generic OO approach. Non-Redundancy Output
    # 0.3 - No Out Object in Objects
    # 1.0 - Better Documentation to go with GASP V:1.2
    # 1.1 - Better DNA stuff
    # 1.2 - Added ClustalW align
    # 1.3 - Separated Sequence object into rje_sequence.py
    # 1.4 - Add rudimentary gnspacc=T/F
    # 1.5 - Changed pwAln to use popen()
    # 1.6 - Fixed nrdic problem in Redundancy check and added user-definition of database list
    # 1.8 - Added UniProt input and acclist reading
    # 1.9 - Added 'reformat=scanseq' option but not properly implemented. Added align=T/F.
    # 2.0 - Major reworking of commandline options and introduction of self.list dictionary (rje v3.0)
    # 2.1 - Added reformat of UniProt with memsaver=T.
    # 2.2 - Added GABLAM non-redundancy
    # 2.3 - Added NR in memsaver mode
    # 2.4 - Changed some of the log output (REM and redundancy) to look better.
    # 2.5 - Added nr_qry to makeNR()
    # 2.6 - Added mysql reformat output: fastacmd, protein_id, acc_num, spec_code, description (delimited)
    # 2.7 - Added SeqCount() method. Incorporated reading of sequence case.
    # 2.8 - Added NEXUS output for MrBayes compatibility
    # 2.9 - Added setupSubDict(masking=True) for use in probabilistic conservation scores
    # 3.0 - Start of improvements for DNA sequences with dna=T.
    # 3.1 - Added relative conservation calculations for a whole alignment.
    # 3.2 - Added output of sequences for making alignments in R.
    # 3.3 - Added 6RF reformatting for DNA sequences.
    # 3.4 - Added HAQBAT option
    # 3.5 - Added extra alignment program, MAFFT
    # 3.6 - Added stripGap() method. Replaced self.seq with self.seqs() for reading. (Replace with list at some point.)
    # 3.7 - Added raw option for single sequence load.
    # 3.8 - Added maxGlob setting for screening out globular proteins.
    # 3.9 - Added reading of mafft format when not producing standard fasta.
    # 3.10- Added ntrim=X : Trims of regions >= X proportion N bases (X residues for protein) [0.5]
    # 3.11- Added mapdna=FILE option to map DNA sequences onto protein alignment
    # 3.12- Added countspec=T/F   : Generate counts of different species and output in log [False]

DisMatrix Class

    Sequence Distance Matrix Class. Author: Rich Edwards (2005).
    See rje_dismatrix.py for details.

SeqList Class

    Sequence List Class. Author: Rich Edwards (2005).

    Info:str
    - AccList = Extract only AccNums in list. X can be FILE or list of AccNums X,Y,.. [None]
    - AlnProg = Choice of alignment program to use (clustalw/muscle/mafft/fsa) [muscle]
    - Basefile = Base for filenames (e.g. extension removed)    !New addition so may have quirky functionality!
    - BLAST Path = path to blast programs
    - ClustalW = Path to ClustalW
    - DBList = List of databases in order of prefernce X,Y,Z (lower case)
    - Description = Description of sequence list (if desired)
    - FasDB = Fasta format database to extract sequences from [None]
    - FilterOut = Saves filtered sequences (as fasta) into FILE. *NOTE: This file is appended* [None]
    - FASTA Path = path to fasta programs
    - FSA = Path to FSA alignment program ['fsa']            
    - HAQBAT = Generate a batch file (FILE) to run HAQESAC on generated BLAST files, with seqin as queries [None]
    - MapDNA = Map DNA sequences from FILE onto sequences of same name in protein alignment [None]
    - MapSeq = Maps sequences from FILE to sequences of same name [None]
    - MAFFT = path to MAFFT
    - MUSCLE = path to MUSCLE
    - Name = Name of sequence group (Generally filename)
    - Query = Query sequence name for selection [None]
    - ReFormat = Outputs sequence in a particular format, where X is fasta/phylip/scanseq/acclist/fastacmd/teiresias [None]
    - RelCons = Returns a file containing Pos AbsCons RelCons [None]
    - SeqName = Output sequence names for PNG files etc. (short/Name/Number/AccNum/ID) [short]
    - SeqOut = Saves 'tidied' sequence file after loading and manipulations
    - Type = Type of sequences
    
    Opt:boolean
    - AccNR = Whether to check for redundant Accession Numbers/Names on loading sequences. [False]
    - Align = Whether the sequences should be aligned. Will align if unaligned. [False]
    - Aligned = Whether sequences are aligned (i.e. same length)
    - AutoFilter = Whether to automatically apply sequence filters etc. upon loading sequence [True]
    - AutoLoad = Whether to automatically load sequences upon initiating object [True]
    - CountSpec = Generate counts of different species and output in log [False]
    - DBOnly = Whether to only allow sequences from listed databases
    - Degap = Degaps each sequence. [False]
    - DNA = Alternative identification of sequences as DNA [False]
    - GapFilter = Whether to filter gappy sequences upon loading [True]
    - Gapped = Whether sequences have gaps
    - GeneSpAcc = Converts sequence names into gene_SPEC/Acc format [False]
    - LogRem = Whether to log removed sequences [True] - suggest use of this with filtering of large files!
    - MakePNG = Whether to make RelCons PNG files [False]
    - Mixed = Whether to allow auto-identification of mixed sequences types (else uses first seq only) [False]
    - NoSplice = If nosplice=T, UniProt splice variants will be filtered out [False]
    - NRKeepAnn = Append annotation of redundant sequences onto NR sequences [False]
    - QueryNR = Perform Non-Redundancy on Query species (True) or limit to non-Query species (False) [True]
    - RNAtoDNA = Converts RNA to DNA [False]
    - SeqNR = Make sequences Non-Redundant [False]
    - SpecNR = Non-Redundancy within same species only
    - TidyGap = Removes any columns from alignments that are 100% gap [True]
    - UnkSpec = Whether sequences of unknown species are allowed [True]    
    - NR Align = Use ALIGN for non-redundancy calculations rather than GABLAMO [False]
    - UseCase = Whether to output sequences in mixed case rather than converting all to upper case [False]
    - ReplaceChar = Whether to replace characters not found in the given alphabet with 'X' [True]
    - StripNum = Whether to remove numbers from the input sequence [True]
    
    Stat:numeric 
    - MaxGap = Maximum proportion of gaps in sequences (<=0 = No maximum) [0]
    - MaxGlob = Maximum proportion of sequence predicted to be ordered (<=0 = None; >=1 = Absolute) [0]
    - MaxLen = Maximum length of sequences (<=0 = No maximum) [0]
    - MinLen = Minimum length of sequences [0]
    - MinORF = Minimum ORF length for a DNA/EST translation (reformatting only) [0]
    - MinPoly = Minimum length of poly-A tail for 3rf / 6rf EST translation (reformatting only) [20]
    - MaxX = Maximum proportion of sequence that may be Xs (<=0 = No maximum) [0]
    - NR ID = %Identity cut-off for Non-Redundancy (GABLAMO) [100.0]
    - NR Sim = %Similarity cut-off for Non-Redundancy (GABLAMO) [0.0 (not used)]
    - NTrim = Trims of regions >= X proportion N bases [0.0]
    - RelConWin = Determines window size for RelCons calculation [30]
    - Split = Splits file into numbered files of X sequences. (Useful for webservers like TMHMM.)
    - Trunc = Truncates each sequence to the first X aa. (Last X aa if -ve) (Useful for webservers like SingalP.) [0]

    List:list
    - Alphabet = Alphabet allowed in sequences [standard 1 letter AA codes]
    #!# GoodX and BadX only retained in memory if MemSaver=False. Otherwise, will load, filter, and clear afterwards #!#
    - GoodAcc = list of good accession numbers
    - GoodSeq = list of good sequence names
    - GoodSpec = list of good species codes
    - GoodDB   = list of good source databases
    - GoodDesc = list of terms that, at least one of which must be in description line
    - BadAcc = list of Bad accession numbers
    - BadSeq = list of Bad sequence names
    - BadSpec = list of Bad species codes
    - BadDB   = list of Bad source databases
    - BadDesc = list of terms that, at least one of which must be in description line
    - Case = List of positions to switch case, starting with first lower case (e.g case=20,-20 will have ends UC) []
    - Blast2Fas = FILE1,FILE2,...,FILEn : Will blast sequences against list of databases and compile a fasta file of results per query

    Obj:RJE_Objects
    - QuerySeq = Sequence Object to be used as query
    - PWAln ID = DisMatrix of %ID from Pairwise global alignments (ALIGN)
    - PWAln Sim = DisMatrix of %Sim from Pairwise global alignments (ALIGN)
    - MSA ID = DisMatrix of %ID from Multiple Sequence Alignment
    - MSA Gaps = DisMatrix of %Gaps from Multiple Sequence Alignment
    - MSA Extra = DisMatrix of %Extra aa from Multiple Sequence Alignment
    - PAM Dis = DisMatrix of PAM Distances
    - UniProt = rje_uniprot.UniProt object if UniProt input and memsaver=F

    Other:
    seq:list of Sequence objects
SeqList.Align2Seq(self,sequence1,sequence2,unlink=True,retry=5)
    Uses align to make a pairwise alignment of seq1 and seq2.
    >> sequence1 & sequence2:string Sequences to align
    >> unlink:Boolean = whether to delete tmp files.
    >> retry:int = Descrease with each retry: if retry = 0, give up!
    << PWAln object
SeqList.MWt(self,sequence)


SeqList._addSeq(self,name,sequence)
    Adds a new Sequence Object to list.
    >> name:str = sequence name line (inc. description)
    >> sequence:str = sequence
SeqList._bestDB(self,seq1,seq2,dblist=None)
    Returns sequence from 'best' database or None if the same.
    Used for determining which sequence to remove in the case of redundancy.
    >> seq1:Sequence Object
    >> seq2:Sequence Object
    >> dblist:list of strings = list of DBase annotations, best to worst (lower case)
    << Sequence Object or None
SeqList._checkAln(self,aln=False,realign=False,tidygaps=True,seqkey='Sequence')
    Checks whether the sequences are aligned using:
    (1) Presence of Gaps
    (2) Equal lengths of sequences.
    >> aln:boolean [False] = whether sequences are 'meant' to be aligned
    >> realign:boolean [False] = whether to realign if aln=True and sequences aren't aligned
    >> tidygaps:boolean [True] = whether to tidy 100% gapped columns if already aligned and meant be
    >> seqkey:str ['Sequence'] = seq.info key to use to check alignment
SeqList._checkForDup(self,remdup=True)
    Checks for and removes Duplicate Sequences. Checks Name, AccNum and Sequence.
    - renames if not removed or sequence different
    >> remdup:Boolean = whether to remove duplicate sequences
SeqList._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
SeqList._countSpec(self)
    Counts sequences for each species and outputs numbers in log file.
SeqList._filterCmd(self,cmd_list=None)
    Reads filter commands into attributes.
    >> cmd_list:list of commands from which to get filter options [None = self.cmd_list]
SeqList._filterSeqs(self)
    Performs automatic sequence filtering based on self.attributes.
SeqList._readFileType(self,seqfile,filetype,re_header)
    Determines format of input file for loading sequences.
    >> seqfile:str = file name
    >> filetype:str = format of sequence file
    - 'fas' = fasta, 'phy' = phylip, 'aln' = clustalW alignment, 'uniprot' = UniProt, 'fastacmd' = Fastacommand names
SeqList._setAttributes(self)
    Sets Attributes of Object.
SeqList._worstSeq(self,seq1,seq2,best_ann=False)
    Returns the sequence that should be removed and reason, based on annotation etc.
    >> seq1:Seq Object
    >> seq2:Seq Object
    >> best_ann:boolean [False] = whether to preferentially keep shorter but better annotated sequences.
    << tuple:(Worst Sequence, reason (string), good sequence shortname)
SeqList.aaCount(self)
    Returns total count of AA in seqlist.
SeqList.aaFreq(self,alphabet=None,fromfile=None,loadfile=None,total=False)
    Returns dictionary of AA (& gap etc.) frequencies.
    >> alphabet:list [None] = list of characters of interest
    - if alphabet == None, will return all characters found in seq
    >> fromfile:str = File from which to read sequences. If None, will use self.seq. [None]
    >> loadfile:str = File of aa frequencies from which to read frequencies. If None, will generate from self.seq [None]
    >> total:boolean = Whether to return additional element {'Total':aax} [False]
    << aafreq:dic = dictionary of frequency of characters
SeqList.aaFreqEntropy(self)
    Returns entropy based on whole alignment AAFreqs.
SeqList.aaTotal(self,nonx=False)
    Returns total number of AA in dataset.
SeqList.accList(self,seqs=None)


SeqList.addMatrix(self,idkey,sym=False)
    Adds distance matrix object if missing.
    >> idkey:str = Key for DisMatrix
    >> sym:boolean = Whether symmetrical
SeqList.align(self,alnprog=None,outfile=None,mapseq=True,alncommands='',log=True,clustalbackup=True)
    Uses program of choicle to align self and returns aligned sequence object.
    >> alnprog:str = Alignment program to use (self.info['AlnProg'] if None)
    >> outfile:str = name for output file. [self.muscle] if None
    >> mapseq:boolean = whether to map sequences back on to (and return) self [True]
    >> alncommands:str = additional commandline commands
    >> log:bool [True] = Whether to report alignment activity to Log
    >> clustalbackup [True] = Whether to use ClustalW as backup in case of failure
    << SeqList object
SeqList.alignWithClustal(self,infile,outfile=None,mapseq=True,alncommands='',log=True)
    Uses clustalw to align self and returns aligned sequence object.
    >> infile:str = name of input file. 
    >> outfile:str = name for output file. [base.aln] if None
    >> mapseq:boolean = whether to map sequences back on to (and return) self [True]
    >> alncommands:str = additional commandline commands
    << SeqList object
SeqList.alignWithFSA(self,infile,outfile=None,mapseq=True,alncommands='',log=True)
    Uses FSA to align self and returns aligned sequence object.
    >> infile:str = name of input file. 
    >> outfile:str = name for output file. [input.muscle] if None
    >> mapseq:boolean = whether to map sequences back on to (and return) self [True]
    >> alncommands:str = additional commandline commands
    << SeqList object
SeqList.alignWithMAFFT(self,infile,outfile=None,mapseq=True,alncommands='',log=True)
    Uses MAFFT to align self and returns aligned sequence object.
    >> infile:str = name of input file. 
    >> outfile:str = name for output file. [input.muscle] if None
    >> mapseq:boolean = whether to map sequences back on to (and return) self [True]
    >> alncommands:str = additional commandline commands
    << SeqList object
SeqList.alignWithMUSCLE(self,infile,outfile=None,mapseq=True,alncommands='',log=True)
    Uses muscle to align self and returns aligned sequence object.
    >> infile:str = name of input file. 
    >> outfile:str = name for output file. [input.muscle] if None
    >> mapseq:boolean = whether to map sequences back on to (and return) self [True]
    >> alncommands:str = additional commandline commands
    << SeqList object
SeqList.autoFilter(self,cmd_list=None)
    Performs automatic sequence filtering based on cmd_list.
    >> cmd_list:list of commandline options [self.cmd_list]
SeqList.autoLoad(self,autoload=None,autofilter=None,memsaver=None)
    Controls automatic loading and filtering. Arguments will use self.opt values if None.
    >> autoload:boolean = Whether to load sequences [None]
    >> autofilter:boolean = Whether to filter sequences [None]
    >> memsaver:boolean = Whether to perform filtering using MemSaver option [None]
SeqList.bestMeanID(self,queries,seqlist,key=None)
    >> queries:list of Sequence Objects to compare
    >> seqlist:list of Sequence Objects used for comparison
    >> key:str = key to self.obj list (DisMatrix object)
    << bestseq = Sequence Object
SeqList.clustalAln(self,outfile=None,mapseq=True,alncommands='')
    Uses clustalw to align self and returns aligned sequence object.
    >> outfile:str = name for output file. [base.aln] if None
    >> mapseq:boolean = whether to map sequences back on to (and return) self [True]
    >> alncommands:str = additional commandline commands
    << SeqList object
SeqList.degapSeq(self,log=True)
    Removes gaps from sequences.
    >> log:boolean = Whether to report degapping in log.
SeqList.dna(self)


SeqList.formatOut(self,SEQOUT,seq,format,delimit=',')
    Saves seq to open file SEQOUT in format.
    >> SEQOUT:file handle = output file, open for writing
    >> seq:Sequence Object to be saved
    >> format:str = format of file (fasta/phylip/scanseq/acclist/idlist/fastacmd/teiresias)
    >> delimit:str = text delimiter for MySQL format
SeqList.fullDetails(self)
    Displays details of SeqList and all Sequences.
SeqList.gapSeqFilter(self,relative='query',keepqry=True)
    Removes gappy sequences, relative to query if given.
    >> relative:str = gap measure relative to:
        - self = own sequence alone (as part of alignment)
        - query = Query sequence
        - neighbour = closest neighbour in alignment (%ID)
    >> keepqry:bool [True] = whether to keep query no matter what
SeqList.getAlphabet(self)
    Returns appropriate alphabet for use.
SeqList.getDis(self,seq1,seq2,key=None,unlink=True)
    Returns/generates appropriate distance.
    >> seq1 & seq2:Sequence Objects
    >> key:str = key to self.obj list (DisMatrix object)
    >> unlink:Boolean = whether to delete tmp files.
    << float = Distance
SeqList.loadSeqs(self,seqfile=None,filetype=None,seqtype=None,aln=None,nodup=None,clearseq=True)
    Loads sequences from file.
    >> seqfile:str = file name
    >> filetype:str = format of sequence file
    - 'fas' = fasta, 'phy' = phylip, 'aln' = clustalW alignment
    >> seqtype:str = type of sequence in file
    >> aln:Boolean = whether sequences should be aligned
    >> nodup:Boolean = whether to check for (and remove) duplicate sequences.
    >> clearseq:Boolean = whether to clear existing sequences prior to loading [True]
SeqList.makeBaseFile(self)
    Makes self.info['Basefile'] based on self.info['Name'].
SeqList.makeNR(self,best_ann=True,text='',nrid=None,samespec=None,blast=100,pw_aln=True,save_red=False,check100=True,skip_to_seq=None,nrsim=None,nr_qry=None)
    Makes sequence list non-redundant. Works on %ID, calculating with align if necessary.
    If best_ann is selected, will choose the sequence with the best annotation:
    - sprot > trEMBL > ens_known > ens_novel > ens_scan
    Otherwise (or if even), will choose the longer sequence else the first sequence.
    >> best_ann:boolean [True] = whether to preferentially keep shorter but better annotated sequences.
    >> text:str = Extra text for removeSeq()
    >> nrid:float [100.0] = percentage identity cut-off for redundancy decision (self.stat['NR ID'])
    >> nrsim:float [100.0] = percentage similarity cut-off for redundancy decision (self.stat['NR Sim'])
    >> check100:boolean [True] = Whether to check for 100% matches by simple sequence comparison
    >> samespec:boolean [None] = Whether to apply cut-off to same species only. [Not check100] (self.opt['SpecNR'])
    >> blast:integer [100] = number of hits is BLAST screen used to identify most similar proteins before applying ID cut-off
        - 0 = no BLAST screen
    >> pw_aln:boolean [True] = will use ALIGN to calculate %ID (<100). If False will use current sequence alignment.
    >> save_red:boolean [False] = if True, will return Sequence Object containing redundant sequences
    >> skip_to_seq:Sequence Object [None] = if given, will not make any tests until Sequence reached.
    >> nr_qry:Sequence object [None] = if given, will only compare sequences to this one sequence (not each other)
SeqList.mapSeq(self,seqlist,mapaln=False)
    Maps sequences (info['Sequence']) from seqlist onto self Sequence Objects.
    - looks for matching shortName() or matching AccNum
    >> seqlist:SeqList Object that has sequences to map
    >> mapaln:bool [False] = Map the new sequences onto an existing alignment.
    << returns True if all sequences mapped, else False
SeqList.mapX(self,query=None,seqlist=[],skipgaps=True,qtrim=False,focus=[0,0])
    Maps Xs from query sequence onto other sequences.
    >> query:Sequence Object from which to take Xs (Returns if None) [None]
    >> seqlist:list of Sequence objects to map Xs onto (uses self.seq if []) []
    >> skipgaps:boolean = Whether to skip gaps (True) or replace gaps with Xs (False) [True]
    >> qtrim:boolean= Whether to replace residues outside of Query with Xs [False]
    >> focus:list of range positions [X:Y] to look at. If Y=0 then [X:]. Outside will be Xd.
SeqList.maxGlob(self,inverse=False)
    Removes sequences with too much predicted order.
SeqList.maxX(self)
    Removes sequences with too many Xs.
SeqList.memSaveAutoFilter(self,cmd_list=None)
    Performs automatic sequence filtering based on cmd_list.
    >> cmd_list:list of commandline options [self.cmd_list]
SeqList.memSaveNR(self)
    Unaligned sequence redundancy method for memsaver mode.
SeqList.muscleAln(self,outfile=None,mapseq=True,alncommands='')
    Uses muscle to align self and returns aligned sequence object.
    >> outfile:str = name for output file. [self.muscle] if None
    >> mapseq:boolean = whether to map sequences back on to (and return) self [True]
    >> alncommands:str = additional commandline commands
    << SeqList object
SeqList.nextFasSeq(self,fileobject=None,lastline=None,raw=False)
    Returns sequence object of next sequence from passed File Object. Returns None if end of file.
    If lastline=None, file MUST be FASTA format and ONE LINE PER SEQUENCE, else Fasta only.
    If lastline given, will return a tuple of (sequence object, nextline)
    Sequence object is also placed in self.seq.
    >> fileobject: File Object from which sequence to be read
    >> lastline:str = last line read from FILE object, typically the next description line
    >> raw:bool = Whether to return (name,sequence) instead of sequence object
    << sequence object or tuple of (sequence object, nextline)
SeqList.nextUniProtSeq(self,fileobject=None,lastline=None)
    Returns sequence object of next sequence from passed File Object. Returns None if end of file.
    Will return a tuple of (sequence object, nextline)
    Sequence object is also placed in self.seq.
    >> fileobject: File Object from which sequence to be read
    >> lastline:str = last line read from FILE object, typically the next description line
    << tuple of (sequence object, nextline)
SeqList.numbersForNames(self,check1toN=False)
    Returns True if all names are purely numbers, else False.
    >> check1toN:boolean = whether to also check that the numbers are 1 to N [False]
SeqList.printLog(self, id='#ERR', text='Log Text Missing!', timeout=True, screen=True, log=True, newline=True)
    Prints text to log with or without run time.
    >> id:str = identifier for type of information
    >> text:str = log text
    >> timeout:boolean = whether to print run time
    >> screen:boolean = whether to print to screen (v>=0)
    >> log:boolean = whether to print to log file
SeqList.pwAln(self,seq1,seq2,unlink=True,retry=5)
    Uses align to make a pairwise alignment of seq1 and seq2.
    >> seq1 & seq2:Sequence Objects
    >> unlink:Boolean = whether to delete tmp files.
    >> retry:int = Descrease with each retry: if retry = 0, give up!
    << PWAln object
SeqList.querySeq(self,query=None)
    Sets Query Sequence if appropriate.
    >> qrycmd:str [None] = Name to be used for query search (equiv. cmd query=X)
    << qry if query selected. None if not.
SeqList.reFormat(self,outfile=None,reformat=None,split=True)
    Saves sequences in appropriate format using self.attributes.
    >> outfile:str = Name of file  ['%s.*' % self.info['Basefile']]
    >> reformat:str = New format of file (fasta/phylip/scanseq/acclist/idlist/fastacmd/teiresias/6rf/3rf/est6rf)
    >> split:boolean = whether to use self.stat['Split'] [True]
SeqList.relConList(self,window=35,slimcalc=None)
    Returns list of relative conservation per residue, incorporating sequence weighting, using qry and seqs.
    >> window:int = number of residues either side to consider. If window=0, will return absolute values. 
SeqList.relCons(self,outfile=None)
    Calculates and outputs alignment relative conservation based on Shannon Entropy.
    >> outfile:str [None] = Name of file to use for output in place of self.info['RelCons']
SeqList.relWinCon(self,conslist,window=30,qry=None)
    Returns relative conservation (x-mean)/sd for +/- windows.
SeqList.removeSeq(self,text='',seq=None,checkAln=False)
    Removes a sequence and logs the reason.
    >> text:str = Reason for removal
    >> seq:Sequence object
    >> checkAln:Boolean = whether to CheckAln (includes TidyGap) after seq removal
SeqList.saveAcc(self,seqs=[],accfile=None,scansite=False,uniprot=False,log=True)
    Saves accession numbers for UniProt retrieval Scansite Parallel upload
    >> seqs:list of Sequence Objects. [self.seq]
    >> accfile:str = Name of file ['%s.acc' % self.info['Basefile']]
    >> scansite:boolean = whether to append a database identifier [False]
    >> uniprot:boolean = whether to output UniProt AccNum only [False]
    >> log:boolean = whether to output report to log [False]
SeqList.saveFasta(self,seqs=[],seqfile=None,linelen=0,name='Name',namelen=0,append=False,id=False,log=True,case=None,screen=None)
    Saves sequences in SeqList object in fasta format
    >> seqs:list of Sequence Objects (if none, use self.seq)
    >> seqfile:str [self.info['Name'].fas] = filename
    >> linelen:int [0] = max seqline length [0 = all on one line]
    >> name:str = Type of name to use as sequence name: 'short'=shortName(), 'AccNum'=AccNum,
            'Teiresias'=Teiresias format, 'Number' = Number only
    >> namelen:int [0] = max length of sequence name [0 = no max]
    >> append:boolean [False] = append, do not overwrite, file
    >> id:boolean [False] = Appends sequence number to start of name.
    >> log:boolean [True] = Whether to log output
    >> case:boolen [False] = Whether to use self.dict['Case'] to set output case
    >> screen:bool [None] = Whether to print log output to screen (None will use log setting)
SeqList.saveNexus(self,seqs=[],seqfile=None,name='short',id=False,log=True)
    Saves sequences in SeqList object in nexus format
    >> seqs:list of Sequence Objects (if none, use self.seq)
    >> seqfile:str [self.info['Name'].nex] = filename
    >> name:str = Type of name to use as sequence name: 'short'=shortName(), 'AccNum'=AccNum, 'num'=Number
    >> id:boolean [False] = Appends sequence number to start of name.
    >> log:boolean [True] = Whether to log output
SeqList.savePhylip(self,seqs=[],seqfile=None,name='num',id=False,log=True)
    Saves sequences in SeqList object in fasta format
    >> seqs:list of Sequence Objects (if none, use self.seq)
    >> seqfile:str [self.info['Name'].phy] = filename
    >> name:str = Type of name to use as sequence name: 'short'=shortName(), 'AccNum'=AccNum, 'num'=Number
        - Note that if any names are >10 characters long, Numbers will be used instead
    >> id:boolean [False] = Appends sequence number to start of name.
    >> log:boolean [True] = Whether to log output
SeqList.saveR(self,seqs=[],seqfile=None,name='Name',namelen=0,id=False,log=True,case=False)
    Saves sequences in SeqList object in TDT format for R conversion to PNG.
    >> seqs:list of Sequence Objects (if none, use self.seq)
    >> seqfile:str [self.info['Name'].fas] = filename
    >> name:str = Type of name to use as sequence name: 'short'=shortName(), 'AccNum'=AccNum, 'Number' = Number only
    >> namelen:int [0] = max length of sequence name [0 = no max]
    >> id:boolean [False] = Appends sequence number to start of name.
    >> log:boolean [True] = Whether to log output
    >> case:boolen [False] = Whether to use self.dict['Case'] to set output case
SeqList.saveScanSeq(self,seqs=[],seqfile=None)
    Saves sequences in format for Scansite Parallel upload.
    >> seqs:list of Sequence Objects. [self.seq]
    >> seqfile:str = Name of file ['%s.scanseq' % self.info['Basefile']]
SeqList.seqAlnPos(self,seq,alnpos,next=True)
    Returns corresponding position of sequence in alignment.
    >> seq:Sequence object
    >> alnpos:int = Position (0->L) in alignment
    >> next:bool [True] = Whether to return next position if gap (else returns -1)
    << returns position in sequence (0->L)
SeqList.seqFromFastaCmd(self,id,dbase=None)
    Returns sequence object of sequence as obtained with fastacmd.
    >> id:str = id of sequence to pass to fastacmd
    >> dbase:str = formatted database
SeqList.seqLen(self,seqkey='Sequence')
    Returns alignment length - if aligned!.
SeqList.seqNameDic(self,key='short',proglog=True)
    Returns a dictionary of seqName:Sequence object for self.seq.
    >> key:str = type of name to use as key:
        'short' = seq.shortName(), else uses seq.info[key]
        'NumName' = trim off leading 'X '
        'UniProt' = Original UniProt IDs.
        'Max' = return a dictionary that has shortNames, IDs and AccNums as keys!
    >> proglog:bool [True] = whether to print output to log
SeqList.seqNum(self)


SeqList.seqs(self,x=None)


SeqList.setupSubDict(self,masking=True,alphabet=[])
    Sets up and returns query subsitution frequency dictionary.
    >> masking:bool [True] = whether to use qry.info['MaskSeq'] if found.
    >> alphabet:list [] = Alphabet used for dictionary keys. 
SeqList.sortByLen(self,longfirst=True,proglog=False)
    Sort sequences according to length.
SeqList.splitSeq(self,split=4000)
    Splits seqList into numbered files of X sequences.
    >> split:int = Number of sequences per files
    << returns list of output files
SeqList.stripGap(self,stripgap=0,codons=False,backup='',gaps=['-'],seqkey='Sequence')
    Removes columns containing gaps from the alignment.
    >> stripgap:num [0] = Number of sequences with gaps before stripping from alignment. Proportion if < 1.
    >> codons:bool [False] = Whether to treat alignment using a codon model (i.e. strip sets of three bases)
    >> backup:str [''] = Backup full-length sequences in seq.info[backup]
    >> gaps:list ['-'] = Characters to recognise as gaps (e.g. could add 'X' for tidyXGap equivalent)
    >> seqkey:str ['Sequence'] = seq.info key to use to check alignment
SeqList.tidyGaps(self,key='Sequence',seqs=[],backup='')
    Removes 100% gap columns from the alignment.
    >> key:str ['Sequence'] = seq.info Key to tidy. (May be MaskSeq etc.)
    >> seqs:list [] = List of seqs used to judge gappiness. e.g. Could be just Query. If empty, will use all.
    >> backup:str [''] = If given, will copy full length info to seq.info[backup]
SeqList.tidyQueryGaps(self,key='Sequence',backup='')


SeqList.tidyXGaps(self,key='Sequence')
    Removes 100% gap/"X" columns from the alignment.
SeqList.truncSeq(self,trunc=0)
    Splits seqList into numbered files of X sequences.
    >> trunc:int = Length of truncated sequences. (-ve values with return last X AAs)
SeqList.units(self)


SeqList.winChop(self,seq,win,slide)
    Chops a sequence into small sequence windows and returns a new SeqList object
    >> seq:str = sequence
    >> win:int = window size
    >> slide:int = steps to slide window by
    << chopped:SeqList = sequence list of fragments

Sequence Class

    Individual Sequence Class. Author: Rich Edwards (2005).
    See rje_sequence.py for details.

rje_seq Module Methods

rje_seq.Blast2Fas(seqlist)
    Will blast sequences against list of databases and compile a fasta file of results per query.
rje_seq.DBSize(callobj,filename,seqsize=False,nonx=False)
    Zips through file and counts number of amino acids/nucleotides.
    >> callobj=Calling Object
    >> filename=Name of Sequence File
    << returns number of amino acids/nucleotides.
rje_seq.MWt(sequence='')
    Returns Molecular Weight of Sequence.
rje_seq.SeqCount(callobj,filename)
    Zips through file and counts number of sequences.
    >> callobj=Calling Object
    >> filename=Name of Sequence File
    << returns number of sequences
rje_seq.SeqInfoListFromFile(callobj,filename,key='short',startfrom=None)
    Zips through file and counts number of sequences.
    >> callobj=Calling Object
    >> filename=Name of Sequence File
    << returns number of sequences
rje_seq.deGap(seqstring)
    Degaps sequence.
rje_seq.eisenbergHydropathy(sequence='',returnlist=True)
    Returns the Eisenberg Hydropathy for the sequence.
    >> sequence:str = AA sequence
    >> returnlist:boolean [True] = Returns list of hydropathies rather than total
    << hyd:float = Eisenberg Hydropathy for the sequence
rje_seq.pamDis(seqlist,pam)
    Makes an all-by-all PAM Distance Matrix.
    - seqin=FILE    : Sequence file (aligned)
    >> seqlist:SeqList Object
    >> pam:rje_pam.PamCtrl Object
rje_seq.pwIDGapExtra(seq1,seq2,nomatch=['X'])
    Returns a dictionary of stats: 'ID','Gaps','Extra', 'Len'. Each stat is a two-element list of [seq1Vseq2,seq2Vseq1].
    Numbers returned are absolute counts and should be divided by 'Len' stat for %.
    >> seq1 & seq2 = strings to be compared.
    >> nomatch = list of characters not to count as ID
rje_seq.rna2dna(seqs=[])
    Converts sequences in list from RNA to DNA (U -> T).
    >> seq:list of Sequence Objects
rje_seq.runMain()


rje_seq.seqInfoList(seqs=[],key='short')
    Returns list of sequence info with key or 'short' for shortName()
rje_seq.surfaceAccessibility(sequence='',returnlist=True)
    Returns a list of Janin et al SA values for each residue. Based on method of Norman Davey.
    >> sequence:str = input sequence
    >> returnlist:boolean [True] = Returns list of hydropathies rather than total
    << sa_prob:list of SA values for each residue

rje_seq Module ToDo Wishlist

    # [y] Read in sequences from Fasta/Phylip/ClustalW format
    # [y] Output to Fasta Format
    # [y] Will check (for) alignment [checkAln()]
    # [y] Chop sequence up into fragments [winChop()]
    # [x] Interactive menus for module and major activities
    # [ ] Tidy and __doc__ all methods
    # [ ] Profiles
    # [y] Distance Matrices
    # [y] SwissProt Download Format input
    # [y] Phylip Format 'phy' output
    # [ ] Clustalw Align format 'aln' output
    # [ ] MapNames methods - either from order or beginnings of names
    # [y] Redundant Accession number filter
    # [ ] Species from speclist.txt (grep option)
    # [ ] SplitByList method - splits sequences according to those present/absent in another SeqList
    # [ ] SplitByDBase method - like above but divides according to source database (combine in splitBy(split=None)
    # [y] Making of PAM all by all distance matrix
    # [y] Make a split=X command - splits file up into chunks of X [4000 for TMHMM] sequences
    # [ ] Add winchop command and output
    # [y] Add a seqout command - saves results of manipulations to file when called directly from command line
    # [y] Add a extra method to remove 100% gapped columns from alignment 'tidygap'
    # [ ] Allow different cases in sequences?
    # [Y] Remove filterout
    # [Y] Allow MemSaver filtering and sequence manipulation
    # [Y] Move dismatrix into different module
    # [Y] reformat=X option
    # [Y] Add good- and badacc, badspec & badseq instead of filter...
    # [ ] Replace ALIGN with GABLAM for most major applications, such as redundancy filtering etc.
    # [ ] Think about moving methods to an rje_seq_filter methods module?
    # [ ] Incorporate rje_hmm
    # [ ] Make a user menu
    # [ ] Read in alignment etc but extract sequence details from UniProt file?
    # [Y] Extend reformat=X => fasta/phylip/scanseq/acclist/fastacmd/teiresias
    # [Y] Add nextUniProtSeq, like nextFasSeq, for memsaver=T: check format in memSaverReFormat method
    # [ ] Add more details of functionality in docstring
    # [ ] Add masking as in SLiMCore
    # [ ] Tidy this ToDo list!
    # [ ] Add in extra alignment commands as commandline option
    # [ ] Replace self.seq with self.list['Sequence']

rje_sequence [version 2.0] DNA/Protein sequence object ~ [Top]

Module: rje_sequence
Description: DNA/Protein sequence object
Version: 2.0
Last Edit: 05/03/11
Imports: rje, rje_disorder
Imported By: aphid, budapest, fiesta, gfessa, happi, presto_V5, qslimfinder, seqmapper, slimfinder, slimsearch, prodigis, rje_seqgen, rje_seqplot, rje_yeast, slim_pickings, rje_codons, rje_embl, rje_genbank, rje_motif_stats, rje_motiflist, rje_motifocc, rje_omim, rje_seq, rje_seqlist, rje_slimcalc, rje_slimcore, rje_slimhtml, rje_uniprot, slimjim
Copyright © 2006 Richard J. Edwards - See source code for GNU License Notice

Function:
This module contains the Sequence Object used to store sequence data for all PEAT applications that used DNA or
protein sequences. It has no standalone functionality.

This modules contains all the methods for parsing out sequence information, including species and source database,
based on the format of the input sequences. If using a consistent but custom format for fasta description lines,
please contact me and I can add it to the list of formats currently recognised.

Uses general modules: copy, os, random, re, sre_constants, string, sys, time
Uses RJE modules: rje, rje_disorder

rje_sequence Module Version History

    # 1.0 - Separated Sequence object from rje_seq.py
    # 1.1 - Rudimentary opt['GeneSpAcc'] added
    # 1.2 - Modified RegExp for sequence detail extraction
    # 1.3 - Added list of secondary accession numbers and hasID() method to check ID and all AccNum (and gnspacc combos)
    # 1.4 - Added Peptide Design methods
    # 1.5 - Added storing of case in a dictionary self.dict['Case'] = {'Upper':[(start,stop)],'Lower':[(start,stop)]}
    # 1.6 - Added disorder and case masking
    # 1.7 - Added FudgeFactor and AA codes
    # 1.8 - Added position-specific AA masking
    # 1.9 - Added EST translation functions. Fixed fudging. Added dna() method.
    # 1.10- Fixed sequence name bug
    # 1.11- Added recognition of UniRef
    # 1.12- Added AA masking
    # 1.13- Added Taxonomy list and UniProt dictionary for UniProt sourced sequences (primarily).
    # 1.14- Added maskRegion()
    # 1.15- Added disorder proportion calculations.
    # 1.16- Added additional Genbank and EnsEMBL BioMart sequence header recognition.
    # 1.17- Added nematode sequence conversion.
    # 2.0 - Replaced RJE_Object with RJE_ObjectLite.

Sequence Class

    Individual Sequence Class. Author: Rich Edwards (2005).

    Info:str
    - Name = Name of sequence
    - Type = Type of sequences (RNA/DNA/Protein)
    - Sequence = Actual sequence
    - Description = Description of sequence list (if desired)
    - ID = Sequence identifier (unique 'word')
    - AccNum = Accession Number
    - DBase = Source database (for AccNum)
    - Gene = Gene Symbol
    - Species = species name
    - SpecCode = SwissProt Species Code
    - Format = Name Format
    - NCBI = NCBI GenBank gi number
    
    Stat:numeric

    Opt:boolean
    - RevComp = Whether sequence has been reverse complemented or not [False]

    List:list
    - IsDisordered = List of True/False values for each residue, whether disordered or not
    - Secondary ID = List of secondary IDs and Accession numbers
    - Taxonomy = List of taxonomic levels for source species
    
    Dict:dictionary
    - Case = Stores case of original input sequence as tuples {'Upper':[(start,stop)],'Lower':[(start,stop)]}
    - UniDAT = UniProt data dictionary (if sequence read from UniProt entry)

    Obj:RJE_Objects
    - Disorder = rje_disorder.Disorder object containing disorder prediction results. (Must run self.disorder() first!)
Sequence.MWt(self)


Sequence._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
Sequence._setAttributes(self)
    Sets Attributes of Object.
Sequence.aaFreq(self,aafreq={},newkeys=True)
    Adds to aafreq dictionary (if given) and returns.
    >> aafreq:dictionary of {aa:freq(count)}
    >> newkeys:boolean [True] = whether to add new AA keys if missing from aafreq
    << aafreq:new dictionary of values
Sequence.aaLen(self)


Sequence.aaNum(self)


Sequence.addSequence(self,sequence='',case=True,caselist=[],stripnum=False)
    Adds sequence to object, extracting case information.
    >> sequence:str [''] = sequence to add to self.info['Sequence']
    >> case:bool [True] = whether to store case information in self.dict['Case']
    >> caselist:list [] = list of case positions to over-ride read in case (self.list['Case'] from rje_seq)
    >> stripnum:boolean [False] = whether to strip numbers from sequences
Sequence.deGap(self)
    Degaps sequence.
Sequence.disorder(self,returnlist=False,reset=False)
    Adds disorder object and predicts disorder. See rje_disorder.py for details.
Sequence.dna(self)


Sequence.extractDetails(self,gnspacc=False)
    Extracts ID, Acc# etc. from sequence name.
Sequence.fudgeFactor(self,posdict={},case=False,gaps=True,wildcards=True)
    Returns the necessary fudge factor to match posdict. Will raise error if it cannot do it!
    >> posdict:dict = {pos (1 to L):sequence}
    >> case:bool [False] = whether to use self.dict['Case'] information.
    >> gaps:bool [True] = whether to leave gaps in the sequence
    >> wildcards:bool [True] = whether to account for masked/wildcard Xs in posdict sequence
Sequence.gappedDisorder(self,gap='prev')
    Returns gapped disorder list.
Sequence.gene(self,newgene=None)


Sequence.getSequence(self,case=False,gaps=True)
    Returns sequence.
    >> case:bool [False] = whether to use self.dict['Case'] information.
    >> gaps:bool [True] = whether to leave gaps in the sequence
Sequence.globProportion(self,absolute=False)
    Returns proportion defined as globular.
Sequence.hasID(self,id=None,gnspacc=True,uniprot=True)
    Checks ID, AccNum and secondary ID for match.
    >> id:str [None] = desired ID.
    >> gnspacc:boolean [True] = whether to try generating GnSpAcc formats for testing.
    >> uniprot:boolean [True] = whether to try making "ID (Acc)" combos for testing.
    << Returns True if ID found or False if not.
Sequence.isDisordered(self,pos=-1,reset=False)
    Returns whether a particular residue is disordered.
    >> pos:int [-1] = position of residue (0->(L-1)). If < 0 will return True/False for status only.
    >> reset:bool [False] = Reset disorder list before assessing.
Sequence.maskAA(self,maskaa,mask='X',log=True)
    Adds disorder object with prediction and masks disorder.
    >> maskaa:list of AAs to be masked
    >> mask:str ['X'] = character to use for masking
    >> log:bool [True] = whether to print masking to log
    << returns number of AAs masked
Sequence.maskCase(self,case='lower',mask='X',log=True)
    Returns sequence.
    >> case:str ['lower'] = whether mask lower/upper case information.
    >> mask:str ['X'] = character to replace sequence with
    >> log:bool [True] = whether to log the amount of masking
Sequence.maskDisorder(self,inverse=False,mask='X',log=True)
    Adds disorder object with prediction and masks disorder.
    >> inverse:bool [False] = Masks out predicted ordered regions (i.e. keeps disorder only)
    >> mask:str ['X'] = character to use for masking
    >> log:bool [True] = whether to print masking to log
Sequence.maskLowComplexity(self,lowfreq=5,winsize=10,mask='X',log=True)
    Masks low complexity regions of sequence.
    >> lowfreq:int = Number of same aas in window size to mask
    >> winsize:int = Size of window to consider
    >> mask:str ['X'] = character to use for masking
    >> log:bool [True] = whether to print masking to log
Sequence.maskPosAA(self,maskdict={},mask='X',log=True)
    Masks position-specific amino acids. Returns and updates sequence.
    >> maskdict:dictionary of {pos:AAs} where pos is 1->L and AAs is a string of the AAs to mask
    >> mask:str ['X'] = character to replace sequence with
    >> log:bool [True] = whether to log the amount of masking
Sequence.maskRegion(self,maskregion=[],inverse=False,mask='X',log=True)
    Masks region of protein.
    >> maskregion:list [] = Pairs of positions to (inclusively) mask
    >> inverse:bool [False] = Masks out predicted ordered regions (i.e. keeps disorder only)
    >> mask:str ['X'] = character to use for masking
    >> log:bool [True] = whether to print masking to log
Sequence.name(self)


Sequence.newGene(self,gene='p',keepsp=False,gnspacc=True)
    Gives sequence new gene.
    >> gene:str = new gene
    >> keepsp:str = whether to keep an UPPER CASE gene identifier
Sequence.nonN(self)


Sequence.nonX(self)


Sequence.reverseComplement(self,rna=False)
    Converts to reverse complement of sequence (upper case).
Sequence.sameSpec(self,otherseq,unk=False)
    Returns true if same spec (or SpecCode) or False if not.
    >> otherseq:Sequence Object.
    >> unk:bool [False] = value to be returned if either sequence is of unknown species.
Sequence.seqLen(self)


Sequence.seqType(self)
    Returns (and possible guesses) Sequence Type
    - Protein if non-ATGCUN
    - DNA if ATGCN only
    - RNA if AUGCN only
Sequence.shortName(self)
    Returns short name = first word of name.
Sequence.sixFrameTranslation(self)
    Translates DNA in all six reading frames into 'Translation' dictionary.
Sequence.specialDetails(self)
    Access special information from Names.
Sequence.trimPolyA(self)
    Removes 3' As.
Sequence.unit(self)




rje_sequence Module Methods

rje_sequence.MWt(sequence='')
    Returns Molecular Weight of Sequence.
rje_sequence.aaFreq(sequence,aafreq={},newkeys=True)
    Adds to aafreq dictionary (if given) and returns.
    >> sequence:str = Sequence for AAFreq calculation
    >> aafreq:dictionary of {aa:freq(count)}
    >> newkeys:boolean [True] = whether to add new AA keys if missing from aafreq
    << aafreq:new dictionary of values
rje_sequence.bestORF(protseq,startm=False,nonx=True)
    Returns longest ORF (or first of equal length).
    >> sequence:str = Protein sequence, i.e. translation of DNA/RNA.
    >> startm:bool [False] = Whether the ORF should start with a methionine
    >> nonx:bool [True] = Whether ORFs should only be assessed in terms of their non-X content.
rje_sequence.caseDict(sequence)
    Return dictionary of {'Upper':[(start,end)],'Lower':[(start,end)]}.
rje_sequence.chargeDict(sequence,callobj=None)
    Performs absolute, net and charge balance calculations.
    >> sequence:str = sequence to calculate stats on.
    >> callobj:Object = calling object for error messages etc.
    << dictionary of {Type:Value)
rje_sequence.codons(sequence,codonfreq={},newkeys=True,rna=True,code_only=True)
    Adds to codonfreq dictionary (if given) and returns.
    >> sequence:str = Sequence for AAFreq calculation
    >> codonfreq:dictionary of {codon:freq(count)}
    >> newkeys:boolean [True] = whether to add new keys if missing from codonfreq
    >> rna:bool [True] = whether to convert to RNA (True) or DNA (False)
    >> code_only:bool [True] = whether to only allow codons in (RNA!) Genetic Code (True)
    << codonfreq:new dictionary of values
rje_sequence.dna2prot(dnaseq,case=False)
    Returns a protein sequence for a given DNA sequence.
rje_sequence.eisenbergHydropathy(sequence,returnlist=False)
    Returns the Eisenberg Hydropathy for the sequence.
    >> sequence:str = AA sequence
    >> returnlist:boolean [False] = Returns list of hydropathies rather than total
    << hyd:float = Eisenberg Hydropathy for the sequence
rje_sequence.estTranslation(dnaseq,minpoly=10,fwdonly=False)
    Returns translation of EST into protein reading frames.
    >> dnaseq:str = DNA sequence to translate
    >> minpoly:str = Min. length of poly-A or poly-T to be recognised and removed.
rje_sequence.estTrunc(dnaseq,minpoly=10,fwdonly=False)
    Returns translation of EST into protein reading frames.
    >> dnaseq:str = DNA sequence to translate
    >> minpoly:str = Min. length of poly-A or poly-T to be recognised and removed.
rje_sequence.extractNameDetails(name,callobj=None)
    Extracts details from sequence name and returns as dictionary.
rje_sequence.getSpecCode(species)
    Returns spec_code for given species. This should be moved later and expanded to allow to read from speclist.
rje_sequence.mapGaps(inseq,gapseq,callobj=None)
    Returns inseq with gaps inserted as found in gapseq.
rje_sequence.peptideDetails(sequence,callobj=None)
    Returns OK if sequence alright, or warning if bad aa combos.
    >> sequence:str = sequence to calculate stats on.
    >> callobj:Object = calling object for error messages etc.
    << string of peptide assessment
rje_sequence.reverseComplement(dnaseq,rna=False)
    Returns the reverse complement of the DNA sequence given.
rje_sequence.sixFrameTranslation(dnaseq)
    Translates DNA in all six reading frames into dictionary.
rje_sequence.specCodeFromName(name)
    Returns the species code from a sequence name. (Stripped down version of Sequence.extractDetails())
    >> name:str = Sequence name
    << spcode:str = Species code
rje_sequence.threeFrameTranslation(dnaseq,minpoly=0)
    Translates DNA in all six reading frames into dictionary.
rje_sequence.trypDigest(sequence='')
    Returns trypsin digestion of sequence.

rje_sequence Module ToDo Wishlist

    # [ ] : Add descriptions of recognised formats to documentation.
    # [ ] : Make more efficient sequence data extraction for larger sequence datasets.

RJE_SLiM [version 1.4] Short Linear Motif class module ~ [Top]

Module: RJE_SLiM
Description: Short Linear Motif class module
Version: 1.4
Last Edit: 21/09/11
Imports: rje, rje_scoring, rje_zen
Imported By: comparimotif_V3, happi, qslimfinder, slimfinder, slimmaker, slimsearch, RankByDistribution, bob, ned_rankbydistribution, rje_slimcalc, rje_slimcore, rje_slimhtml, rje_slimlist, slimfrap, slimjim
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
This module contains the new SLiM class, which replaces the old Motif class, for use with both SLiMFinder and
SLiMSearch. In addition, this module encodes some general motif methods. Note that the new methods are not
designed with Mass Spec data in mind and so some of the more complicated regexp designations for unknown amino acid
order etc. have been dropped. Because the SLiM class explicitly deals with *short* linear motifs, wildcard gaps are
capped at a max length of 9.

The basic SLiM class stores its pattern in several forms:
- info['Sequence'] stores the original pattern given to the Motif object
- info['Slim'] stores the pattern as a SLiMFinder-style string of defined elements and wildcard spacers
- dict['MM'] stores lists of Slim strings for each number of mismatches with flexible lengths enumerated. This is
used for actual searches in SLiMSearch.
- dict['Search'] stores the actual regular expression variants used for searching, which has a separate entry for
each length variant - otherwise Python RegExp gets confused! Keys for this dictionary relate to the number of
mismatches allowed in each variant and match dict['MM'].

The following were previously used by the Motif class and may be revived for the new SLiM class if needed:
- list['Variants'] stores simple strings of all the basic variants - length and ambiguity - for indentifying the
"best" variant for any given match

The SLiM class is designed for use with the SLiMList class. When a SLiM is added to a SLiMList object, the
SLiM.format() command is called, which generates the 'Slim' string. After this - assuming it is to be kept -
SLiM.makeVariants() makes the 'Variants' list. If creating a motif object in another module, these method should be
called before any sequence searching is performed. If mismatches are being used, the SLiM.misMatches() method must
also be called.

SLiM occurrences are stored in the dict['Occ'] attribute. The keys for this are Sequence objects and values are
either a simple list of positions (1 to L) or a dictionary of attributes with positions as keys.

Commandline:
These options should be listed in the docstring of the module using the motif class:
- * alphabet=LIST : List of letters in alphabet of interest [AAs]
- * ambcut=X : Cut-off for max number of choices in ambiguous position to be shown as variant (0=All) [10]
- * trimx=T/F : Trims Xs from the ends of a motif [False]
- * dna=T/F : Whether motifs should be considered as DNA motifs [False]

Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje
Other modules needed: None

rje_slim Module Version History

    # 0.0 - Initial Compilation.
    # 1.0 - Initial working version.
    # 1.1 - Added DNA option.
    # 1.2 - Added "N of M or B" format options.
    # 1.3 - Fixed the terminal variant bug.
    # 1.4 - Added makeSLiM method for converting a list of instances into a regexp

SLiM Class

    Short Linear Motif class. Author: Rich Edwards (2007). Based on old rje_motif_V3.Motif object.

    Info:str
    - Name = Name of motif
    - Description = Description of motif
    - Sequence = *Original* pattern given to motif
    - Slim = Reformatted sequence
    
    Opt:boolean
    - DNA = Whether motifs should be considered as DNA motifs [False]
    - TrimX = Trims Xs from the ends of a motif

    Stat:numeric
    - AmbCut = Cut-off for max number of choices in ambiguous position to be shown as variant [10]
        For mismatches, this is the max number of choices for an ambiguity to be replaced with a mismatch wildcard
    - IC = Information Content of motif

    List:list
    - Alphabet = List of letters in alphabet of interest [AAs]

    Dict:dictionary    
    - MM = stores lists of Slim strings for each number of mismatches. This is used for actual searches in SLiMSearch.
    - Occ = Dictionary of occurrences {Sequence:[Pos/{Hit stats}]} (Cannot use pos as key due to variable wildcards
    - Search = stores the actual regular expression variants used for searching, which has a separate entry for
      each length variant - otherwise Python RegExp gets confused! Keys for this dictionary relate to the number of
      mismatches allowed in each variant and match dict['MM'].
      
    Obj:RJE_Objects
    - SLiMList = Parent SLiMList object
SLiM._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
SLiM._setAttributes(self)
    Sets Attributes of Object.
SLiM.ambCutPattern(self)


SLiM.format(self,reverse=False)
    Formats motif to generate self.info['Slim'].
    >> reverse:boolean [False] = whether to reverse sequence.
    << returns True if successful, or False if reformatting fails.
SLiM.misMatch(self,mismatch={})
    Makes self.dict['MM'] using mismatch dictionary.
    >> mismatch:dictionary of {mm 'X':Y aa}
SLiM.occFilter(self,occfilter)
    Filters occurrences according to given OccFilter dictionary.
SLiM.occList(self)
    Returns simple list of occurrences.
SLiM.occNum(self)
    Returns total number of occurrences.
SLiM.pattern(self)


SLiM.searchDict(self)
    Makes 'Search' dictionary from mismatch dictionary.
SLiM.searchSequence(self,seq=None,sequence='',logtext='')
    Searches the given sequence for occurrences of self and returns a list of hit dictionaries: Pos,Variant,Match
    >> seq:Sequence object [None] = add hits to occurrence dictionary with this as key
    >> sequence:str = sequence to be searched if no sequence object given
    >> logtext:str [''] = text to precede progress printing. If '', no progress printing!
    << hitlist:list of dictionaries with hit information: Pos,Variant,Match,ID,MisMatch
SLiM.seqNum(self)


SLiM.seqs(self)


SLiM.slim(self,wings=False)


SLiM.slimFix(self)


SLiM.slimLen(self)


SLiM.slimMinLen(self)


SLiM.slimPos(self)


SLiM.variant(self,Occ)
    Returns actual variant used for match in Occ.
SLiM.weightedIC(self,aafreq)
    Replaces IC with weighted IC given aafreq, and returns.
    >> aafreq:dict = AA frequency dictionary
SLiM.wildScram(self)
    Performs wildcard scrambling.

rje_slim Module Methods

rje_slim.ambigPos(ambig,defaults)
    Returns formatted ambiguities.
rje_slim.ambigPrint(ambig,defaults)
    Returns formatted ambiguities.
rje_slim.checkSlimFormat(sequence)
    Checks whether given sequence is in correct "Slim" format.
rje_slim.convertDNA(slimcode)
    Splits, converts to DNA and recombines.
rje_slim.elementIC(element='')
    Calculates the IC for a given pattern element. This is much simplified from the rje_Motif method as (a) amino acid
    frequencies are never used, and (b) only wildcards can have flexible lengths, which does not affect this score.
    >> element:str = part of motif pattern
    << returns calculated IC for element
rje_slim.expectString(_expect)
    Returns formatted string for _expect value.
rje_slim.makeSlim(peptides,minseq=3,minfreq=0.75,maxaa=5,callobj=None,ignore='X-')
    Generates a regexp SLiM from a peptide list (no variable wildcards).
    >> peptides:list of str = List of peptide sequences (aligned, no gaps)
    >> minseq:int [3] = Min. no. of sequences for an aa to be in
    >> minfreq:num [0.75] = Min. combined freq of accepted aa to avoid wildcard
    >> maxaa:int [5] = Max. no. different amino acids for one position
    >> ignore:str ['X-'] = Amino acid(s) to ignore. (If nucleotide, would be N)
rje_slim.patternFromCode(slim,ambcut=20,dna=False)
    Returns pattern with wildcard for iXj formatted SLiM (e.g. A-3-T-0-G becomes A...TG).
rje_slim.prestoFromCode(slim)
    Returns old PRESTO list from new SLiM code. (Cannot have variable wildcard - will use longer.
rje_slim.slimFix(slim)


rje_slim.slimFixFromCode(slim)
    Returns the number of positions in a slim.
rje_slim.slimFromPattern(inseq,reverse=False,trimx=False,motif=None,dna=False)
    Formats motif to generate self.info['Slim'].
    >> inseq:str = Sequence to reformat. Can be slimcode or pattern (inc. PROSITE)
    >> reverse:boolean [False] = whether to reverse sequence.
    >> trimx:boolean [False] = whether to remove leading and trailing wildcards
    >> dna:boolean [False] = whether motif is a DNA motif
    << returns slim code.
rje_slim.slimLen(slim)


rje_slim.slimLenFromCode(slim)
    Returns maximum length of SLiM in slimcode format.
rje_slim.slimMinLenFromCode(slim)
    Returns maximum length of SLiM in slimcode format.
rje_slim.slimPos(slim)


rje_slim.slimPosFromCode(slim)
    Returns the number of positions in a slim.
rje_slim.stripMotifBrackets(pattern)
    Strips unneccessary brackets from motifs.
rje_slim.test(motif,cmd_list=[],mm={})
    Temp test method.
rje_slim.weightedElementIC(element,aafreq)
    Calculates the IC for a given pattern element, weighted by amino acid frequencies are never used. Only wildcards can
    have flexible lengths, which does not affect this score.
    >> element:str = part of motif pattern
    >> aafreq:dict = dictionary of amino acid (or DNA) frequencies: length of dictionary defines no. of possibilities
    << returns calculated IC for element

rje_slim Module ToDo Wishlist

    # [Y] : Add new %x either/or regular expression. E.g. %1K..P..P..%1K = K..P..P or P..P..K
    # [ ] : Add weighted IC

rje_slimcalc [version 0.4] SLiM Attribute Calculation Module ~ [Top]

Module: rje_slimcalc
Description: SLiM Attribute Calculation Module
Version: 0.4
Last Edit: 03/06/10
Imports: rje, rje_zen, rje_aaprop, rje_disorder, rje_seq, rje_sequence, rje_slim, rje_scoring, gopher_V2, ned_eigenvalues
Imported By: qslimfinder, slimfinder, slimsearch, rje_seqplot, ned_conservationScorer, rje_seq, rje_slimcore, rje_slimlist
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
This module is based on the old rje_motifstats module. It is primarily for calculating empirical attributes of SLiMs
and their occurrences, such as Conservation, Hydropathy and Disorder.

Commandline:
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Motif Occurrence Attribute Options ###
* slimcalc=LIST : List of additional attributes to calculate for occurrences - Cons,SA,Hyd,Fold,IUP,Chg,Comp []
* winsize=X : Used to define flanking regions for calculations. If negative, will use flanks *only* [0]
* relconwin=X : Window size for relative conservation scoring [30]
* iupath=PATH : The full path to the IUPred exectuable [c:/bioware/iupred/iupred.exe]
* iucut=X : Cut-off for IUPred results (0.0 will report mean IUPred score) [0.0]
* iumethod=X : IUPred method to use (long/short) [short]
* percentile=X : Percentile steps to return in addition to mean [0]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Alignment Settings ###
* usealn=T/F : Whether to search for and use alignments where present. [False]
* alndir=PATH : Path to pre-made alignment files [./]
* alnext=X : File extension of alignment files, AccNum.X (checked before Gopher used) [aln.fas]
* usegopher=T/F : Use GOPHER to generate missing orthologue alignments in alndir - see gopher_V2.py options [False]
* gopherdir=PATH : Path from which to call Gopher (and look for PATH/ALN/AccNum.orthaln.fas) [./]
* fullforce=T/F : Whether to force regeneration of alignments using GOPHER [False]
* orthdb=FILE : File to use as source of orthologues for GOPHER []
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Conservation Parameters ###
* conspec=LIST : List of species codes for conservation analysis. Can be name of file containing list. [None]
* conscore=X : Type of conservation score used: [pos]
- abs = absolute conservation of motif using RegExp over matched region
- pos = positional conservation: each position treated independently
- prob = conservation based on probability from background distribution
- prop = conservation of amino acid properties
- rel = relative local conservation (rlc)
- all = all three methods for comparison purposes
* consamb=T/F : Whether to calculate conservation allowing for degeneracy of motif (True) or of fixed variant (False) [True]
* consinfo=T/F : Weight positions by information content (does nothing for conscore=abs) [True]
* consweight=X : Weight given to global percentage identity for conservation, given more weight to closer sequences [0]
- 0 gives equal weighting to all. Negative values will upweight distant sequences.
* minhom=X : Minimum number of homologues for making conservation score [1]
* homfilter=T/F : Whether to filter homologues using seqfilter options [False]
* alngap=T/F : Whether to count proteins in alignments that have 100% gaps over motif (True) or (False) ignore
as putative sequence fragments [False] (NB. All X regions are ignored as sequence errors.)
* posmatrix=FILE : Score matrix for amino acid combinations used in pos weighting. (conscore=pos builds from propmatrix) [None]
* aaprop=FILE : Amino Acid property matrix file. [aaprop.txt]
* masking=T/F : Whether to use seq.info['MaskSeq'] for Prob cons, if present (else 'Sequence') [True]
* vnematrix=FILE : BLOSUM matrix file to use for VNE relative conservation []
* relgappen=T/F : Whether to invoke the "Gap Penalty" during relative conservation calculations [True]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### SLiM/Occ Filtering Options ###
* slimfilter=LIST : List of stats to filter (remove matching) SLiMs on, consisting of X*Y []
- X is an output stat (the column header),
- * is an operator in the list >, >=, !=, =, >= ,<
- Y is a value that X must have, assessed using *.
This filtering is crude and may behave strangely if X is not a numerical stat!
!!! Remember to enclose in "quotes" for <> filtering !!!
* occfilter=LIST : Same as slimfilter but for individual occurrences []

Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje
Other modules needed: None

rje_slimcalc Module Version History

    # 0.0 - Initial Compilation based on rje_motifstats methods.
    # 0.1 - Added new probability-based method, inspired by (but different to) Dinkel & Sticht (2007)
    # 0.2 - Mended OccPos finding for wildcards. Added new relative conservation score.
    # 0.3 - Added von Neumann entropy code.
    # 0.4 - Added Webserver pickling of RLC lists.

SLiMCalc Class

    SLiMCalc Class. Author: Rich Edwards (2007).

    Info:str
    - AlnDir = Path to alignment files [./]
    - AlnExt = File extensions of alignments: AccNum.X [aln.fas]
    - ConScore = Type of conservation score used:  [abs]
        - abs = absolute conservation of motif: reports percentage of homologues in which conserved
        - prop = conservation of amino acid properties
        - prob = conservation based on probability from background distribution
        - pos = positional conservation: each position treated independently 
        - all = all three methods for comparison purposes
    - GopherDir = Directory from which GOPHER will be run []
    - PosMatrix = Score matrix for amino acid combinations used in pos weighting. (conscore=pos builds from propmatrix)
    - VNEMatrix = BLOSUM matrix file to use for VNE relative conservation []
    
    Opt:boolean
    - AlnGap = Whether to count proteins in alignments that have 100% gaps over motif (True) or (False) ignore as putative sequence fragments [True]
    - ConsAmb = Whether to calculate conservation allowing for degeneracy of motif (True) or of fixed variant (False) [True]
    - ConsInfo = Weight positions by information content [True]
    - FullForce = Whether to force regeneration of alignments using GOPHER
    - HomFilter = Whether to filter homologues using seqfilter options [False]
    - Masking = Whether to use seq.info['MaskSeq'] for Prob cons, if present (else 'Sequence') [True]
    - RelGapPen = Whether to invoke the "Gap Penalty" during relative conservation calculations [True]
    - UseAln = Whether to look for conservation in alignments
    - UseGopher = Use GOPHER to generate missing orthologue alignments in outdir/Gopher - see gopher.py options [False]
    
    Stat:numeric
    - ConsWeight = Weight given to global percentage identity for conservation, given more weight to closer sequences [0]
    - MinHom = Minimum number of homologues for making conservation score [1]
    - Percentile = Percentile steps to return in addition to mean [25]
    - RelConWin = Determines window size for RelCons calculation [30]
    - WinSize = Used to define flanking regions for stats. If negative, will use flanks *only* [0]

    List:list
    - Alphabet = List of letters in alphabet of interest
    - GopherRun = List of sequence objects that have already been put through GOPHER. (Do not keep retrying!)
    - Headers - List of Headers for combined occurrence stats
    - OccHeaders - List of Headers for individual occurrences
    - Percentile - List of Percentiles to return for combined stats (set up with Headers)
    - SLiMCalc - List of occurrence statistics to calculate []
    - SlimFilter - List of stats to filter (remove matching) SLiMs on, consisting of X*Y  []
                      - X is an output stat (the column header),
                      - * is an operator in the list >, >=, !=, =, >= ,<    
                      - Y is a value that X must have, assessed using *.
                      This filtering is crude and may behave strangely if X is not a numerical stat!
                      !!! Remember to enclose in "quotes" for <> filtering !!!
    - OccFilter = Same as slimfilter but for individual occurrences []

    Dict:dictionary
    - ConsSpecLists = Dictionary of {BaseName:List} lists of species codes for special conservation analyses
    - ElementIC = Element IC values
    - PosMatrix = Score matrix for amino acid combinations used in pos weighting. (conscore=pos builds from propmatrix) {}
    - BLOSUM = BLOSUM matrix loaded from VNE calculation {}

    Obj:RJE_Objects
    - AAPropMatrix = rje_aaprop.AAPropMatrix object
    - Gopher = Gopher Fork object for alignment generation
    - IUPred = Disorder object for running IUPred disorder
    - FoldIndex = Disorder object for running FoldIndex disorder
SLiMCalc._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
SLiMCalc._setAttributes(self)
    Sets Attributes of Object.
SLiMCalc.aaFreqEntropy(self,qry,seqs)
    Returns entropy based on whole alignment AAFreqs.
SLiMCalc.absCons(self,Occ,hithom,seqfrag,seqwt)
    Absolute conservation score.
SLiMCalc.bestScore(self,aa,aalist,posmatrix)
    >> aa:str = Amino acid to be compared
    >> aalist:str = List of aas to compare to
    >> posmatrix:dict of {'a1a2':score} to get score from
SLiMCalc.combMotifOccStats(self,occlist,revlist=['Hyd'])
    Combines mean and percentile stats for the Occurrences of a Motif. OccList should all be for same SLiM.
    >> occlist:list of MotifOcc objects to calculate stats for (must all have same Motif)
    >> revlist:list of stats that should be ordered from low(best) to high(worst) rather than the other way round
SLiMCalc.consWeight(self,hitcon,seqwt)
    Weights conservation and returns final conservation score.
    >> hitcon: raw dictionary of {seq:conservation}
    >> seqwt: weighting dictionary of {seq:weighting}
    << cons: conservation score
SLiMCalc.findFudge(self,qryseq,match,pos)
    >> qryseq: degapped query sequence
    >> match:match sequence 
    >> pos:position match is meant to be from 0 to L
    << fudge:int = amount to move pos to find match in qryseq. 0 = not there!
SLiMCalc.findOccPos(self,Occ,qry,fudge=0)
    Finds Motif Occurence in alignment.
    >> Occ = MotifOcc object
    >> qry = query Sequence object from alignment file
    >> fudge = amount to try shifting match to find occurrence is non-matching sequence
    << (start,end) = start and end position in aligment to allow sequence[start:end]
SLiMCalc.hitAlnCon(self,occlist,progress=True)
    Looks for alignment and, if appropriate, calculate conservation stats.
    
    Any homologues with masked (X) residues that coincide to non-wildcard positions of the motif occurrence will be
    ignored from conservation calculations. Gaps, however, shall be treated as divergence. The exception is that when
    the alngap=F option is used, 100% gapped regions of homologues are also ignored.

    This method deals with all the occurrences of all motifs for a single sequence and its alignment. Global
    alignment statistics are calculated first, then each occurrence for each motif is processed. Since version 1.1,
    subtaxa are treated the same as all taxa to reduce the coding: the default all taxa is now effectively an
    additional subtaxa set.
    
    >> occlist:list of MotifOcc objects to calculate stats for (must all have same Seq)
    >> progress:bool [True] = whether to print progress to screen or not
SLiMCalc.loadBLOSUM(self)
    Loads BLOSUM matrix for VNE calculations.
SLiMCalc.loadOrthAln(self,seq,gopher=True)
    Identifies file, loads and checks alignment. If the identified file is not actually aligned, then RJE_SEQ will try to
    align the proteins using MUSCLE or ClustalW.
    >> seq:Sequence being analysed.
    >> gopher:bool [True] = whether to try and use GOPHER if appropriate (set False if already tried)
    << aln = SeqList object containing alignment with queryseq
SLiMCalc.occStats(self,occlist,xpad=0,progress=False,silent=False)
    Calculates general occurrence stats for occlist - should all have same Seq object. 
    >> occlist:list of MotifOcc objects to calculate stats for (must all have same Seq)
    >> xpad:int [0] = Xs to be added to either side of sequence
    >> progress:bool [False] = whether to print progress to screen or not
    >> silent:bool [False] = whether to make verbosity -1 for duration of occStats 
SLiMCalc.posCons(self,Occ,hithom,red_aln,seqwt,posmatrix)
    Positional conservation score.
SLiMCalc.posMatrix(self)
    Loads and builds PosMatrix for Conservation Scoring.
SLiMCalc.probCons(self,Occ,hithom,red_aln,seqwt,subdict)
    Positional conservation score.
SLiMCalc.relConList(self,qry,seqs,seqwt,window=30,store=False)
    Returns list of relative conservation per residue, incorporating sequence weighting, using qry and seqs.
SLiMCalc.relConListFromSeq(self,seq,window=30,store=False)
    Looks for alignment and returns list of relative conservation per residue, incorporating sequence weighting.
    >> seq:Sequence object to analyse
    >> window:int [30] = Size of +/- window for calculation
    >> store:bool [False] = Whether to store results in seq.list['Cons'] and seq.list['RelCons']
SLiMCalc.relCons(self,Occ,relcon)
    Relative conservation score.
    >> Occ:Motif occurrence dictionary.
    >> relcon:list = Relative conservation score for each residue
SLiMCalc.relWinCon(self,conslist,window=30,qry=None)
    Returns relative conservation (x-mean)/sd for +/- windows.
SLiMCalc.seqAlnConList(self,seq)
    Looks for alignment and, if appropriate, calculate conservation stats. Returns a list of the pos/prop
    conservation score for each residue, incorporating sequence weighting.
SLiMCalc.setupFilters(self,slimheaders=[],occheaders=[])
    Sets up SLiM/Occ Filters.
    >> slimheaders:list [] = List of SLiM Attribute headers that can be used as filters.
    >> occheaders:list [] = List of SLiM Occurrence Attribute headers that can be used as filters.
SLiMCalc.setupGopher(self)
    Sets up GOPHER directory etc.
SLiMCalc.setupHeaders(self)
    Sets up Headers and OccHeaders lists based on attribute settings.
SLiMCalc.singleProteinAlignment(self,seq,occlist,alndir='',hitname='AccNum',gopher=True,savefasta=True,wintuple=0)
    Generates copies of protein alignment, with motif hits marked. Return SeqList object.
    >> seq:Sequence object for alignment
    >> occlist:list of MotifOcc objects for sequence
    >> alndir:str = Alignment directory for output
    >> hitname:str = Format of hitnames, used for naming files
    >> gopher:boolean = whether to look to use Gopher if settings correct (set this False if done already)
    >> savefasta:boolean [True] = whether to save fasta file or simply return SeqList object alone.
    >> wintuple:int [0] = Add a "WinTuple" list to aln Object containing [WinStart,WinEnd,VariantAln] (A list!)
SLiMCalc.vneEigen(self,aafreq)
    Converts amino acid frequencies into VNE eigen values.

rje_slimcalc Module Methods

rje_slimcalc.runMain()


rje_slimcalc.seqConsList(seq,callobj=None)
    Creates SLiMCalc object and calls seqAlnConList(seq) method.
    >> seq:Sequence object
    >> callobj:Object containing relevant log and cmd_list objects. If None, will use seq

rje_slimcalc Module ToDo Wishlist

    # [ ] : Add and fully integrate the new/custom score options.
    # [ ] : Add new probabilistic Conservation score.
    # [ ] : Expand probabilistic Conservation score to other slimcalc scores.
    # [ ] : Split into rje_seqcons, which should then be inherited by this module but usable by rje_seq alone.

rje_slimcore [version 1.7] Core module/object for SLiMFinder and SLiMSearch ~ [Top]

Module: rje_slimcore
Description: Core module/object for SLiMFinder and SLiMSearch
Version: 1.7
Last Edit: 03/06/10
Imports: rje, rje_blast, rje_motiflist, rje_seq, rje_sequence, rje_scoring, rje_slim, rje_slimcalc, rje_slimlist, rje_motif_V3, rje_dismatrix_V2, comparimotif_V3, unifake
Imported By: aphid, pingu, qslimfinder, slimfinder, slimsearch
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
This module is primarily to contain core dataset processing methods for both SLiMFinder and SLiMSearch to inherit and
use. This primarily consists of the options and methods for masking datasets and generating UPC. This module can
therefore be run in standalone mode to generate UPC files for SLiMFinder or SLiMSearch.

In addition, the secondary MotifSeq and Randomise functions are handled here.

Secondary Functions:
The "MotifSeq" option will output fasta files for a list of X:Y, where X is a motif pattern and Y is the output file.

The "Randomise" function will take a set of input datasets (as in Batch Mode) and regenerate a set of new datasets
by shuffling the UPC among datasets. Note that, at this stage, this is quite crude and may result in the final
datasets having fewer UPC due to common sequences and/or relationships between UPC clusters in different datasets.

Commandline: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Basic Input/Output Options ###
* seqin=FILE : Sequence file to search [None]
* batch=LIST : List of files to search, wildcards allowed. (Over-ruled by seqin=FILE.) [*.dat,*.fas]
* maxseq=X : Maximum number of sequences to process [500]
* maxupc=X : Maximum UPC size of dataset to process [0]
* sizesort=X : Sorts batch files by size prior to running (+1 small->big; -1 big->small; 0 none) [0]
* walltime=X : Time in hours before program will abort search and exit [1.0]
* resdir=PATH : Redirect individual output files to specified directory (and look for intermediates) [SLiMFinder/]
* buildpath=PATH : Alternative path to look for existing intermediate files [SLiMFinder/]
* force=T/F : Force re-running of BLAST, UPC generation and SLiMBuild [False]
* dna=T/F : Whether the sequences files are DNA rather than protein [False]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Evolutionary Filtering Options ###
* efilter=T/F : Whether to use evolutionary filter [True]
* blastf=T/F : Use BLAST Complexity filter when determining relationships [True]
* blaste=X : BLAST e-value threshold for determining relationships [1e=4]
* altdis=FILE : Alternative all by all distance matrix for relationships [None]
* gablamdis=FILE : Alternative GABLAM results file [None] (!!!Experimental feature!!!)
* domtable=FILE : Domain table containing domain ("Type") and sequence ("Name") pairings for additional UPC [None]
* homcut=X : Max number of homologues to allow (to reduce large multi-domain families) [0]
* extras=T/F : Whether to generate additional output files (distance matrices etc.) [True]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Input Masking and AA Frequency Options ###
* masking=T/F : Master control switch to turn off all masking if False [True]
* dismask=T/F : Whether to mask ordered regions (see rje_disorder for options) [False]
* consmask=T/F : Whether to use relative conservation masking [False]
* ftmask=LIST : UniProt features to mask out [EM,DOMAIN,TRANSMEM]
* imask=LIST : UniProt features to inversely ("inclusively") mask. (Seqs MUST have 1+ features) []
* fakemask=T/F : Whether to invoke UniFake to generate additional features for masking [False]
* compmask=X,Y : Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8]
* casemask=X : Mask Upper or Lower case [None]
* metmask=T/F : Masks the N-terminal M (can be useful if SLiMFinder termini=T) *Also maskm=T/F* [True]
* posmask=LIST : Masks list of position-specific aas, where list = pos1:aas,pos2:aas *Also maskpos=LIST* [2:A]
* aamask=LIST : Masks list of AAs from all sequences (reduces alphabet) []
* motifmask=X : List (or file) of motifs to mask from input sequences []
* logmask=T/F : Whether to output the log messages for masking of individual sequences to screen [False]
* masktext=X : Text ID to over-ride automated masking text and identify specific masking settings [None]
* maskpickle=T/F : Whether to save/load pickle of masked input data, independent of main pickling [False]
* maskfreq=T/F : Whether to use masked AA Frequencies (True), or (False) mask after frequency calculations [True]
* aafreq=FILE : Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
* smearfreq=T/F : Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False]
* qregion=X,Y : Mask all but the region of the query from (and including) residue X to residue Y [0,-1]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Advanced Output Options ###
* targz=T/F : Whether to tar and zip dataset result files (UNIX only) [False]
* pickle=T/F : Whether to save/use pickles [True]
* savespace=0 : Delete "unneccessary" files following run (best used with targz): [0]
- 0 = Delete no files
- 1 = Delete all bar *.upc, *.pickle and *.occ.csv files (Pickle excluded from tar.gz with this setting)
- 2 = Delete all bar *.upc, *.pickle files (Pickle excluded from tar.gz with this setting)
- 3 = Delete all dataset-specific files including *.upc and *.pickle (not *.tar.gz)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Additional Functions I: MotifSeq ###
* motifseq=LIST : Outputs fasta files for a list of X:Y, where X is the pattern and Y is the output file []
* slimbuild=T/F : Whether to build motifs with SLiMBuild. (For combination with motifseq only.) [True]

### Additional Functions II: Randomised datasets ###
* randomise=T/F : Randomise UPC within batch files and output new datasets [False]
* randir=PATH : Output path for creation of randomised datasets [Random/]
* randbase=X : Base for random dataset name [rand]
* randsource=FILE : Source for new sequences for random datasets (replaces UPCs) [None]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

Uses general modules: copy, glob, math, os, string, sys, time
Uses RJE modules: rje, rje_blast, rje_dismatrix_V2, rje_seq, rje_scoring, rje_slim, rje_slimlist
Other modules needed: None

rje_slimcore Module Version History

    # 0.0 - Initial Compilation based on SLiMFinder 3.1.
    # 0.1 - Tidied with respect to SLiMFinder and SLiMSearch.
    # 0.2 - Added DNA mode.
    # 0.3 - Added relative conservation masking.
    # 0.4 - Altered TarGZ to *not* include *.pickle.gz
    # 1.0 - Standardised masking options. Added motifmask and motifcull.
    # 1.1 - Checked/updated Randomise option.
    # 1.2 - Added generation of UniFake file to maskInput() method. (fakemask=T)
    # 1.3 - Added aamask effort
    # 1.4 - Added DomTable option
    # 1.5 - Added masktext and maskpickle options to accelerate runs with large masked datasets.
    # 1.6 - Fixed occurrence table bugs.
    # 1.7 - Added SizeSort and NewUPC. Add server #END statements.

SLiMCore Class

    SLiMFinder Class. Author: Rich Edwards (2007).

    Info:str
    - AAFreq = Use FILE to replace individual sequence AAFreqs (FILE can be sequences or aafreq) [None]
    - AltDis = Alternative all by all distance matrix for relationships [None]
    - Build = String given summary of key SLiMBuild options
    - BuildPath = Alternative path to look for existing intermediate files [SLiMFinder/]
    - CaseMask = Mask Upper or Lower case [None]
    - CompMask = Mask low complexity regions (same AA in X+ of Y consecutive aas) [5,8]
    - DomTable = Domain table containing domain ("Type") and sequence ("Name") pairings for additional UPC [None]
    - GablamDis = Alternative GABLAM results file [None]
    - Input = Original name (and path) of input file
    - MaskText = Text ID to over-ride automated masking text and identify specific masking settings [None]
    - MotifMask = List (or file) of motifs to mask from input sequences []
    - RanDir = Output path for creation of randomised datasets [./]
    - Randbase = Base for random dataset name [rand_]
    - RandSource = Source for new sequences for random datasets (replaces UPCs) [None]
    - ResDir = Redirect individual output files to specified directory [SLiMFinder/]
    
    Opt:boolean
    - ConsMask = Whether to use relative conservation masking [False]
    - DisMask = Whether to mask ordered regions (see rje_disorder for options) [False]
    - DNA = Whether the sequences files are DNA rather than protein [False]
    - Force = whether to force recreation of key files [False]
    - EFilter = Whether to use evolutionary filter [True]
    - Extras = Whether to generate additional output files (alignments etc.) [False]
    - FakeMask = Whether to invoke UniFake to generate additional features for masking [False]
    - LogMask = Whether to log the masking of individual sequences [True]
    - Masked = Whether dataset has been masked [False]
    - Masking = Master control switch to turn off all masking if False [True]
    - MaskM = Masks the N-terminal M (can be useful if termini=T) [False]
    - MaskPickle = Whether to save/load pickle of masked input data, independent of main pickling [False]
    - OccStatsCalculated = Whether OccStats have been calculated for all occurrence [False]
    - MaskFreq = Whether to mask input before any analysis, or after frequency calculations [True]
    - Pickle = Whether to save/use pickles [True]
    - Randomise = Randomise UPC within batch files [False]
    - SlimBuild = Whether to build motifs with SLiMBuild. (For combination with motifseq only.) [True]
    - SmearFreq = Whether to "smear" AA frequencies across UPC rather than keep separate AAFreqs [False]
    - TarGZ = Whether to tar and zip dataset result files (UNIX only) [False]
    - Test = Special Test parameter for experimentation with code [False]
    - Webserver = Generate additional webserver-specific output [False]

    Stat:numeric
    - HomCut = Max number of homologues to allow (to reduce large multi-domain families) [0]
    - MaxSeq = Maximum number of sequences to process [500]
    - MaxUPC = Maximum UPC size of dataset to process [0]
    - MST = MST corrected size for whole dataset
    - SaveSpace = Delete "unneccessary" files following run (see Manual for details) [0]
    - SizeSort = Sorts batch files by size prior to running (+1 small->big; -1 big->small; 0 none) [0]
    - StartTime = Starting time in seconds (for output when using shared log file)
    - WallTime = Time in hours before program will terminate [1.0]

    List:list
    - AAMask = Masks list of AAs from all sequences (reduces alphabet) []
    - Alphabet = List of characters to include in search (e.g. AAs or NTs) 
    - Batch = List of files to search, wildcards allowed. (Over-ruled by seqin=FILE.) [*.dat,*.fas]
    - FTMask = UniProt features to mask out [EM,DOMAIN,TRANSMEM]
    - Headers = Headers for main SLiMFinder output table
    - IMask = UniProt features to inversely ("inclusively") mask [IM]
    - QRegion = Mask all but the region of the query from (and including) residue X to residue Y [0,-1]
    - SigSlim = List of significant SLiMs - matches keys to self.dict['Slim(Freq)'] - *in rank order*
    - UP = List of UP cluster tuples
    - Warning = List of text (log) warnings to reproduce at end of run
     
    Dict:dictionary
    - AAFreq = AA frequency dictionary for each seq / UPC
    - DimFreq = Frequency of dimers of each X length per upc {upc:[freqs]}
    - Dimers = main nested dictionary for SLiMFinder {Ai:{X:{Aj:{'UP':[UPC],'Occ':[(Seq,Pos)]}}}}
    - ElementIC = dictionary of {Motif Element:Information Content}
    - Extremf. = Dictionary of {length:extremferroni correction}
    - MaskPos = Masks list of position-specific aas, where list = pos1:aas,pos2:aas  [2:A]
    - MotifSeq = Dictionary of {pattern:output file for sequences}
    - MST = MST corrected size for UPC {UPC:MST}
    - Slim = main dictionary containing SLiMs with enough support {Slim:{'UPC':[UPC],'Occ':[Seq,Pos]}}
    - SeqOcc = dictionary of {Slim:{Seq:Count}}
    - SmearFreq = Smeared AA frequency dictionary for IC calculations

    Obj:RJE_Objects
    - SeqList = main SeqList object containing dataset to be searched
    - SlimList = MotifList object handling motif stats and filtering options
SLiMCore.OLDaddSLiMToList(self,slim)
    Add slims from self.dict['Slim'] to self.obj['SlimList'].
SLiMCore.UPNum(self)


SLiMCore._cmdList(self)


SLiMCore._setAttributes(self)
    Sets Attributes of Object.
SLiMCore.aaNum(self)


SLiMCore.abNprob(self,a,b,N,overlap=0,dirn='less')
    Returns probabilities of no overlap in a given b/N and in b given a/N.
SLiMCore.addSLiMToList(self,slim)
    Add slims from self.dict['Slim'] to self.obj['SlimList'].
SLiMCore.batchFiles(self,pickup=[])
    Returns batch file list.
SLiMCore.calculateSLiMOccStats(self)
    Makes entries to SLiMList object and calculates attributes with slimcalc.
SLiMCore.coreCmd(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
SLiMCore.coreDefaults(self)
    ### Sets the default parameters used by SLiMCore functions (for easy inheritance).
SLiMCore.dataset(self)


SLiMCore.extras(self,level=1)
    Returns whether extras settings permit output at this level.
SLiMCore.getUP(self,seq)


SLiMCore.loadUPCFromFile(self,ufile)
    Loads UPC details from specific file.
SLiMCore.makeAAFreq(self)
    Makes an initial AAFreq dictionary containing AA counts only (including Xs).
SLiMCore.makeMST(self,gablam=None)
    Makes UPC dictionary from GABLAM.
SLiMCore.makeUPC(self)
    Generates UP Clusters from self.obj['SeqList'] using BLAST.
SLiMCore.maskInput(self)
    Masks input sequences, replacing masked regions with Xs. Creates seq.info['PreMask' & 'MaskSeq']
SLiMCore.maskPickle(self,load=False)
    Loads or Saves post-masking pickle.
SLiMCore.maskPickleMe(self,basefile=None,gzip=True,replace=True)
    Saves self object to pickle and zips.
    >> basefile:str [None] = if none, will use self.info['Basefile']
    >> gzip:bool [True] = whether to GZIP (win32=F only)
    >> replace:bool [True] = whether to replace existing Pickle
SLiMCore.maskText(self,joiner='',freq=True)
    Returns masking text.
SLiMCore.meanDisMST(self,gablam=None)
    Makes UPC dictionary from GABLAM using mean distance instead of MST.
SLiMCore.motifSeq(self)
    Outputs sequence files for given motifs.
SLiMCore.myPickle(self)
    Returns pickle identifier, also used for Outputs "Build" column.
SLiMCore.newBatchRun(self,infile)
    Returns SLiMCore object for new batch run.
SLiMCore.nonX(self)


SLiMCore.occFilter(self)


SLiMCore.occStats(self)


SLiMCore.randomise(self)
    Makes random datasets using batch file UPCs.
SLiMCore.readUPC(self)
    Looks for UPC file and loads details from file.
SLiMCore.run(self,batch=False)
    Main Run Method. SLiMFinder, SLiMSearch etc. should have their own methods.
    1. Check for randomise function and execute if appropriate
    2. Input:
        - Read sequences into SeqList
        - or - Identify appropriate Batch datasets and rerun each with batch=True
    3. Masking and UPC construction.
    4. MotifSeq option if desired.
    >> batch:bool [False] = whether this run is already an individual batch mode run.
SLiMCore.seqBaseFile(self)
    Returns the results directory basefile as specified by sequence alone (e.g. for SLiMSearch)
SLiMCore.seqNum(self)


SLiMCore.seqs(self)


SLiMCore.serverEnd(self,endcause,details=None,exit=True)
    Adds #END statement for webserver.
    >> endcause:str = Identifier for ending program run.
    >> details:str [None] = Extra information for certain end causes only.
    >> exit:bool [True] = whether to terminate program after #END statement.
SLiMCore.setAlphabet(self)
    Sets up self.list['Alphabet'].
SLiMCore.setQRegion(self)
    Sets up Query Region masking.
SLiMCore.setupBasefile(self)
    Sets up self.info['Basefile'].
SLiMCore.slimIC(self,slim,usefreq=False)
    Returns IC of slim. Does not account for variable length wildcards!
SLiMCore.slimNum(self)


SLiMCore.slimOccNum(self,slim,upc=None)
    Returns number of occ of Slim in given UPC.
SLiMCore.slimSeqNum(self,slim)
    Returns the number of sequences SLiM occurs in.
SLiMCore.slimUP(self,slim)


SLiMCore.smearAAFreq(self,update=True)
    Equalises AAFreq across UPC. Leaves Totals unchanged. Updates or returns smearfreq.
SLiMCore.statFilter(self)


SLiMCore.tarZipSaveSpace(self)
    Tars and Zips output and/or deletes extra files as appropriate.
SLiMCore.tidyMotifObjects(self)
    This should not be necessary but somehow is!
SLiMCore.units(self)


SLiMCore.wallTime(self)
    Exits if walltime has been reached.

rje_slimcore Module Methods

rje_slimcore.patternFromCode(slim)
    Returns pattern with wildcard for iXj formatted SLiM (e.g. A-3-T-0-G becomes A...TG).
rje_slimcore.runMain()


rje_slimcore.slimDif(slim1,slim2)
    Returns number of different positions between slim1 and slim2.
rje_slimcore.slimLen(slim)


rje_slimcore.slimPos(slim)




rje_slimcore Module ToDo Wishlist

    # [ ] : Reduce methods and attributes of SLiMCore object to only those necessary for function.
    # [ ] : Reduce commandline options accordingly - proper use of coreCmd() and add coreDefaults().
    # [ ] : Add proper use of seqbase and basefile for SLiMSearch results output.

rje_slimlist [version 0.5] SLiM dataset manager ~ [Top]

Module: rje_slimlist
Description: SLiM dataset manager
Version: 0.5
Last Edit: 16/11/10
Imports: rje, rje_seq, rje_slim, rje_slimcalc, rje_zen, rje_motif_V3
Imported By: comparimotif_V3, qslimfinder, slimfinder, slimsearch, bob, rje_iridis, rje_slimcore
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
This module is a replace for the rje_motiflist module and contains the SLiMList class, a replacement for the
MotifList class. The primary function of this class is to load and store a list of SLiMs and control generic SLiM
outputs for such programs as SLiMSearch. This class also controls motif filtering according to features of the motifs
and/or their occurrences.

Commandline: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Basic Input/Output Options ###
* motifs=FILE : File of input motifs/peptides [None]
Single line per motif format = 'Name Sequence #Comments' (Comments are optional and ignored)
Alternative formats include fasta, SLiMDisc output and raw motif lists.
* reverse=T/F : Reverse the motifs - good for generating a test comparison data set [False]
* wildscram=T/F : Perform a wildcard spacer scrambling - good for generating a test comparison data set [False]
* motifout=FILE : Name of output file for reformatted/filtered SLiMs (PRESTO format) [None]
* ftout=T/F : Make a file of UniProt features for extracted parent proteins, where possible, incoroprating SLIMs [*.features.tdt]
* mismatch=LIST : List of X:Y pairs for mismatch dictionary, where X mismatches allowed for Y+ defined positions []
* motinfo=FILE : Filename for output of motif summary table (if desired) [None]
* dna=T/F : Whether motifs should be considered as DNA motifs [False]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Advanced Input I: Motif Filtering Options ###
* minpos=X : Min number of defined positions [0]
* minfix=X : Min number of fixed positions for a motif to contain [0]
* minic=X : Min information content for a motif (1 fixed position = 1.0) [0.0]
* goodmotif=LIST : List of text to match in Motif names to keep (can have wildcards) []
* nrmotif=T/F : Whether to remove redundancy in input motifs [False]

### Advanced Input II: Motif reformatting options ###
* trimx=T/F : Trims Xs from the ends of a motif [False]
* minimotif=T/F : Input file is in minimotif format and will be reformatted (PRESTO File format only) [False]
* ambcut=X : Cut-off for max number of choices in ambiguous position to be shown as variant [10]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
### Advanced Output I: Motif Occurrence Statistics Options ###
* slimcalc=LIST : List of additional statistics to calculate for occurrences - Cons,SA,Hyd,Fold,IUP,Chg []
* winsize=X : Used to define flanking regions for stats. If negative, will use flanks *only* [0]
* iupath=PATH : The full path to the IUPred exectuable [c:/bioware/iupred/iupred.exe]
* iucut=X : Cut-off for IUPred results (0.0 will report mean IUPred score) [0.0]
* iumethod=X : IUPred method to use (long/short) [short]
* percentile=X : Percentile steps to return in addition to mean [0]
* peptides=T/F : Whether to output peptide sequences based on motif and winsize [False]

### Advanced Output II: Alignment Settings ###
* usealn=T/F : Whether to search for and use alignemnts where present. [False]
* gopher=T/F : Use GOPHER to generate missing orthologue alignments in alndir - see gopher.py options [False]
* alndir=PATH : Path to alignments of proteins containing motifs [./] * Use forward slashes (/)
* alnext=X : File extension of alignment files, accnum.X [aln.fas]
* protalndir=PATH : Output path for Protein Alignments [ProteinAln/]
* motalndir=PATH : Output path for Motif Alignments []
* flanksize=X : Size of sequence flanks for motifs [30]
* xdivide=X : Size of dividing Xs between motifs [10]
* fullforce=T/F : Whether to force regeneration of alignments using GOPHER

### Advanced Output III: Conservation Parameters ###
* conspec=LIST : List of species codes for conservation analysis. Can be name of file containing list. [None]
* conscore=X : Type of conservation score used: [pos]
- abs = absolute conservation of motif using RegExp over matched region
- pos = positional conservation: each position treated independently
- prop = conservation of amino acid properties
- all = all three methods for comparison purposes
* consamb=T/F : Whether to calculate conservation allowing for degeneracy of motif (True) or of fixed variant (False) [True]
* consinfo=T/F : Weight positions by information content (does nothing for conscore=abs) [True]
* consweight=X : Weight given to global percentage identity for conservation, given more weight to closer sequences [0]
- 0 gives equal weighting to all. Negative values will upweight distant sequences.
* alngap=T/F : Whether to count proteins in alignments that have 100% gaps over motif (True) or (False) ignore
as putative sequence fragments [False] (NB. All X regions are ignored as sequence errors.)
* posmatrix=FILE : Score matrix for amino acid combinations used in pos weighting. (conscore=pos builds from propmatrix) [None]
* aaprop=FILE : Amino Acid property matrix file. [aaprop.txt]
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#

Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje
Other modules needed: None

rje_slimlist Module Version History

    # 0.0 - Initial Compilation.
    # 0.1 - Added SLiMCalc object to replace rje_motif_stats.
    # 0.2 - Added DNA option.
    # 0.3 - Added wildcard spacer scrambling - good for generating a test comparison data set.
    # 0.4 - Added motif-splitting based on ("|") options.
    # 0.5 - Fixed minor typo bug.

SLiMList Class

    SLiMList Class. Author: Rich Edwards (2007).

    Info:str
    - AlnDir = Path to alignment files [./]
    - AlnExt = File extensions of alignments: AccNum.X [aln.fas]
    - ConScore = Type of conservation score used:  [abs]
        - abs = absolute conservation of motif: reports percentage of homologues in which conserved
        - prop = conservation of amino acid properties
        - pos = positional conservation: each position treated independently 
        - all = all three methods for comparison purposes
    - MotAlnDir = Directory name for output of protein aligments []
    - Motifs = Name of input motif file [None]
    - MotInfo = Filename for output of motif summary table (if desired) [None]
    - MotifOut = Filename for output of reformatted (and filtered?) motifs in PRESTO format [None]
    - PosMatrix = Score matrix for amino acid combinations used in pos weighting. (conscore=pos builds from propmatrix)
    - ProtAlnDir = Directory name for output of protein aligments [ProteinAln/]
    
    Opt:boolean
    - AlnGap = Whether to count proteins in alignments that have 100% gaps over motif (True) or (False) ignore as putative sequence fragments [True]
    - Compare = whether being called by CompariMotif (has some special requirements!)
    - ConsAmb = Whether to calculate conservation allowing for degeneracy of motif (True) or of fixed variant (False) [True]
    - ConsInfo = Weight positions by information content [True]
    - DNA = Whether motifs should be considered as DNA motifs [False]
    - FTOut = Make a file of UniProt features for extracted parent proteins, where possible, incoroprating SLIMs [*.features.tdt]
    - FullForce = Whether to force regeneration of alignments using GOPHER
    - Gopher = Use GOPHER to generate missing orthologue alignments in outdir/Gopher - see gopher.py options [False]
    - MiniMotif = Input file is in minimotif format and will be reformatted [False]
    - NRMotif = Whether to remove redundancy in input motifs [False]
    - Peptides = Whether to output peptide sequences based on motif [False]
    - Reverse = Reverse the motifs - good for generating a test comparison data set [False]
    - TrimX = Trims Xs from the ends of a motif
    - UseAln = Whether to look for conservation in alignments
    - WildScram = Perform a wildcard spacer scrambling - good for generating a test comparison data set [False]
    
    Stat:numeric
    - AmbCut = Cut-off for max number of choices in ambiguous position to be shown as variant [10]
        For mismatches, this is the max number of choices for an ambiguity to be replaced with a mismatch wildcard
    - ConsWeight = Weight given to global percentage identity for conservation, given more weight to closer sequences [0]
    - FlankSize = Size of sequence flanks for motifs in MotifAln [30]
    - MinPos = Minimum number of defined positions in motif [0]
    - MinFix = Min number of fixed positions for a motif to contain [0]
    - MinIC = Min information content for a motif (1 fixed position = 1.0) [0.0]
    - Percentile = Percentile steps to return in addition to mean [25]
    - WinSize = Used to define flanking regions for stats. If negative, will use flanks *only* [0]
    - XDivide = Size of dividing Xs between motifs [10]

    List:list
    - Alphabet = List of letters in alphabet of interest
    - GoodMotif = List of text to match in Motif names to keep (can have wildcards) []
    - Motif = List of rje_slim.SLiM objects
    - OccStats - List of occurrence statistics to calculate []

    Dict:dictionary
    - ConsSpecLists = Dictionary of {BaseName:List} lists of species codes for special conservation analyses
    - ElementIC = Dictionary of {Position Element:IC}
    - MisMatch = Dictionary of {X mismatches:Y+ defined positions}
    - PosMatrix = Score matrix for amino acid combinations used in pos weighting. (conscore=pos builds from propmatrix) {}

    Obj:RJE_Objects
    - AAPropMatrix = rje_aaprop.AAPropMatrix object
    - SeqList = SeqList object of search database
    - SLiMCalc = SLiMCalc object for SLiM/Occurrence attribute calculations
SLiMList.FTOut(self,acc_occ)
    Produces a single file for the dataset of extracted UniProt features with the motif positions added.
    >> acc_occ:dictionary of {AccNum:{Motif:[(position:match)]}
SLiMList.OLDstatFilter(self,datadict)
    Returns True if motif should be kept according to self.obj['Presto'].list['StatFilter']. 
    >> datadict:dictionary of hit stats
    << True/False if accepted/filtered
SLiMList._addMotif(self,name,seq,reverse=False,check=False,logrem=True,wildscram=False)
    Adds new motif to self.list['Motif']. Checks redundancy etc.
    >> name:str = Motif Name
    >> seq:str = Motif Sequence read from file
    >> reverse:boolean [False] = whether to reverse sequence
    >> check:boolean [False] = whether to check redundancy and sequence length
    >> logrem:boolean [True] = whether to log removal of motifs
    >> wildscram:boolean [False] = whether to perform wildcard scrambling on motif
    << returns Motif object or None if failed
SLiMList._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
SLiMList._setAttributes(self)
    Sets Attributes of Object.
SLiMList._splitMotif(self,name,seq,desc)
    Splits complex motifs on "|" and adds each separately.
SLiMList.addOcc(self,Seq=None,Motif=None,data={},merge=False)
    Adds a MotifOcc object with the data given.
    >> Seq:Sequence object against in which the occurrence lies
    >> Motif:Motif object
    >> data:dictionary of data to add {'Info':{},'Stat':{},'Data':{}}
    >> merge:bool [False] = whether to check for existing occurrence and merge if found
SLiMList.calculateOccAttributes(self,silent=False,wallobj=None)
    Executes rje_slimcalc calculations via rje_slimlist object.
SLiMList.checkForOcc(self,Occ,merge=True)
    Returns existing MotifOcc if one is found, else given Occ.
    >> Occ:MotifOcc object to check
    >> merge:bool [True] = whether to merge Occ with found occurrence, if there is one
SLiMList.combMotifOccStats(self,revlist=['Hyd'],progress=True,motiflist=[])
    Combines mean and percentile stats for the Occurrences of each Motif. Combines all attributes stored in
    self.obj['SLiMCalc'].list['Headers']
    >> statlist:list of stats to combine from occurrences
    >> revlist:list of stats that should be ordered from low(best) to high(worst) rather than the other way round
    >> progress:boolean [True] = whether to log progress of stat combination
    >> motiflist:list of motifs to combine (will use all if none given)
SLiMList.loadMotifs(self,file=None,clear=True)
    Loads motifs and populates self.list['Motif'].
    >> file:str = filename or self.info['Motif'] if None
    >> clear:boolean = whether to clear self.list['Motif'] before loading [True]
SLiMList.mapMotif(self,Occ,update=True)
    Returns Motif Object and updates Occ, based on self.slims() and Occ data.
    >> Occ:MotifOcc object to check
    >> update:bool [True] = whether to update own list['Motifs'] and/or Motif objects
SLiMList.mapPattern(self,pattern,update=True)
    Returns Motif Object based on self.slims(). Will add if update=True and pattern missing.
    >> pattern:str = motif pattern to check
    >> update:bool [True] = whether to update own list['Motifs'] and/or Motif objects
SLiMList.mergeOcc(self,occlist,overwrite=False)
    Adds Info, Stat and Data from occlist[1:] to occlist[0]. If overwrite=False, then the occurrences should be in
    order of preferred data, as only missing values will be added. If overwrite=True, then later MotifOcc in the list
    will overwrite the values of the earlier ones. (Note that it is always the first MotifOcc object that is changed.
    Note also that no checks are made that the objects *should* be merged. (See self.checkForOcc())
    >> occlist:list of MotifOcc to merge. Will add to occlist[0] from occlist[1:]
    >> overwrite:bool [False] = If False, will only add missing Info/Stat/Data entries. If True, will add all.
SLiMList.motifAlignments(self,resfile='motifaln.fas')
    Makes motif alignments from occurrences. MotifOcc objects should have Sequence objects associated with them. If
    necessary, add a method to go through and generate Sequence objects using MotifOcc.info['FastaCmd'].
    >> resfile:str = Name of output file
SLiMList.motifAlnLong(self,Motif,seq_occ,append=False,memsaver=False,resfile='')
    Makes a single MotifAln output.
    >> Motif:Motif Object
    >> seq_occ:dictionary of {Seq:[MotifOcc]}
    >> append:bool [False] = whether to append file or create new
    >> memsaver:bool [False] = whether output is to be treated as a single sequence or all occurrences
    >> resfile:str [''] = name for motif alignment output file
SLiMList.motifInfo(self)
    Produces summary table for motifs, including expected values and information content.
SLiMList.motifNum(self)


SLiMList.motifOcc(self,byseq=False,justdata=None,fastacmd=False,nested=True,maxocc=0)
    Returns a list or dictionary of MotifOccurrences. Partially converts occurrences to old rje_motifocc style with
    Seq and Motif as part of occurrence.
    >> byseq:bool [False] = return a dictionary of {Sequence:{Motif:OccList}}, else {Motif:{Sequence:OccList}} 
    >> justdata:str [None] = if given a value, will return this data entry for each occ rather than the object itself
    >> fastacmd:bool [False] = whether to return FastaCmd instead of Sequence if Sequence missing
    >> nested:bool [True] = whether to return a nested dictionary or just the occurrences per Motif/Seq
    << returns dictionary or plain list of MotifOcc if byseq and bymotif are both False    
SLiMList.motifOut(self,filename='None',motlist=[])
    Outputs motifs in PRESTO format.
    >> filename:str [None] = Name for output file. Will not output if '' or 'None'.
    >> motlist:list of Motif objects to output. If [], will use self.list['Motifs']
SLiMList.motifs(self)


SLiMList.nameList(self)


SLiMList.outputs(self)
    Processes addition outputs for motif occurrences (Alignments and Features).
SLiMList.patternStats(self,log=False)
    Performs calculations all motifs based on basic pattern (info['Sequence']), adding to Motif.stat/info/opt.
SLiMList.proteinAlignments(self,alndir='',hitname='AccNum')
    Generates copies of protein alignments, with motif hits marked.
SLiMList.rankMotifs(self,stat,cutoff=0)
    Reranks Motifs using stat and reduces to cutoff if given.
    >> stat:str = Stat to use for ranking
    >> cutoff:int [0] = number of top ranks to keep
SLiMList.reformatMiniMotif(self,mlines)
    Reformats MiniMotif file, compressing motifs as appropriate.
    >> mlines:list of lines read from input file
SLiMList.removeMotif(self,Motif,remtxt='')
    Removes motif and occurrences from self.
    >> Motif:Motif object to remove
    >> remtxt:str = Text to output to log
SLiMList.seqExp(self,seq=None)
    Populates self.dict[Expect].
    >> seq:Sequence object to consider
SLiMList.seqListExp(self,seqlist=None,filename='',cutoff=0)
    Populates self.dict[Expect].
    >> seqlist:SeqList object to consider
    >> filename:Sequence files to consider (must be fasta)
    >> cutoff:float [0] = Expectation cutoff, will remove motif if exceeded.
SLiMList.setupDomFilter(self)
    Sets up self.dict['DomFilter'].
SLiMList.setupFilters(self,slimheaders=[],occheaders=[])


SLiMList.slimNum(self)


SLiMList.slims(self)


SLiMList.statFilter(self,occdata={},statfilter={})
    Filters using rje_scoring and removes filtered MotifOcc.
    >> occdata:dictionary of {Occ:Statdict}
    >> statfilter:dictionary to statfilters
    << occdata: returns reduced occdata dictionary (same object - pre-reassign if original required!)
SLiMList.statFilterMotifs(self,statfilter)
    Filters motifs using statfilter.

rje_slimlist Module Methods

rje_slimlist.runMain()




rje_slimlist Module ToDo Wishlist

    # [ ] : Add defined ambiguities to automatically include

rje_svg [version 0.0] RJE SVG Module ~ [Top]

Module: rje_svg
Description: RJE SVG Module
Version: 0.0
Last Edit: 31/12/10
Imports: rje, rje_zen
Imported By: rje_ppi, rje_slimhtml, rje_tree
Copyright © 2009 Richard J. Edwards - See source code for GNU License Notice

Function:
The function of this module will be added here.

Commandline:
### ~~~ INPUT ~~~ ###
* col=LIST : Replace standard colour listing (mixed Hex and RGB) []

See also rje.py generic commandline options.

Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje, rje_zen
Other modules needed: None

rje_svg Module Version History

    # 0.0 - Initial Compilation.

SVG Class

    SVG Class. Author: Rich Edwards (2010).

    Info:str
    
    Opt:boolean

    Stat:numeric

    List:list
    Col = Standard colour listing (mixed Hex and RGB)
    
    Dict:dictionary
    Col = Custom colour dictionaries. 

    Obj:RJE_Objects
SVG._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
SVG._setAttributes(self)
    Sets Attributes of Object.
SVG.col(self,i=0,type='Col',cycle=True,convert={})


SVG.cwCol(self,aa,consaa)
        if(sum(aa == c("X","x",".")) > 0){ return(soton$col[6]) }
        if(aa == "-"){ return("black") }
        if(aa == "G"){ return(rgbindex$ORANGE) }
        if(aa == "P"){ return(rgbindex$YELLOW) }
        if(aa == "T" & cwcontains(c("t","T","S","%","#"),consaa)){ return(rgbindex$GREEN) }
        if(aa == "S" & cwcontains(c("t","T","S","#"),consaa)){ return(rgbindex$GREEN) }
        if(aa == "N" & cwcontains(c("n","N","D"),consaa)){ return(rgbindex$GREEN) }
        if(aa == "Q" & cwcontains(c("q","Q","E","K","R","+"),consaa)){ return(rgbindex$GREEN) }
        if(sum(aa == c("W","L","V","I","M","F")) > 0 & cwcontains(c("%","#","A","C","F","H","I","L","M","V","W","Y","P","p"),consaa)){ return(rgbindex$BLUE) }
        if(aa == "A" & cwcontains(c("%","#","A","C","F","H","I","L","M","V","W","Y","P","p","T","S","s","G"),consaa)){ return(rgbindex$BLUE) }
        if(aa == "C" & cwcontains(c("%","#","A","S","F","H","I","L","M","V","W","Y","P","p"),consaa)){ return(rgbindex$BLUE) }
        if(aa == "C" & cwcontains(c("C"),consaa)){ return(rgbindex$PINK) }
        if(aa == "H" & cwcontains(c("%","#","A","C","F","H","I","L","M","V","W","Y","P","p"),consaa)){ return(rgbindex$CYAN) }
        if(aa == "Y" & cwcontains(c("%","#","A","C","F","H","I","L","M","V","W","Y","P","p"),consaa)){ return(rgbindex$CYAN) }
        if(aa == "E" & cwcontains(c("-","E","D","q","Q"),consaa)){ return(rgbindex$MAGENTA) }
        if(aa == "D" & cwcontains(c("-","E","D","n","N"),consaa)){ return(rgbindex$MAGENTA) }
        if(aa == "R" & cwcontains(c("+","K","R","Q"),consaa)){ return(rgbindex$RED) }
        if(aa == "K" & cwcontains(c("+","K","R","Q"),consaa)){ return(rgbindex$RED) }
        return("white")
        return(acol[[aa]])
    }
SVG.networkPlot(self,npos,G,nodecol={},font=16,width=1600,height=1200,ntype='ellipse',cutspace=True,xoffset=0,yoffset=0)
    Plot partial PPI network with given coordinates.
    >> npos:dict = Dictionary of {node:(x,y)}
    >> G:dict = Dictionary of {node:{node:col}} or {node:[nodes]}
    >> cutspace:bool = whether to cut unneccessary whitespace (crudely)
SVG.run(self)
    Main run method.
SVG.setup(self)
    Main class setup method.
SVG.setupCol(self,overwrite=False)
            for (aa in c("I","L","V","IFLMV","IFLM","IFMV","IFLV","FLMV","ILMV","IL","IV","ILV","ILF")){ acol[aa] = soton$col[19] }
        for (aa in c("A","M")){ acol[aa] = soton$col[19] }
        for (aa in c("AG","GS","AS","AGS")){ acol[aa] = soton$col[21] }
        for (aa in c("K","R","KR")){ acol[aa] = soton$col[3] }
        for (aa in c("D","E","DE")){ acol[aa] = soton$col[15] }
        for (aa in c("S","T","ST")){ acol[aa] = soton$col[14] }
        for (aa in c("C")){ acol[aa] = soton$col[9] }
        for (aa in c("P")){ acol[aa] = soton$col[17] }
        for (aa in c("F","Y","W","FY","FW","WY","FWY")){ acol[aa] = soton$col[18] }
        for (aa in c("G")){ acol[aa] = soton$col[5] }
        for (aa in c("H","HK","HR","HKR")){ acol[aa] = soton$col[2] }
        for (aa in c("HY","FH","FHY")){ acol[aa] = soton$col[1] }
        for (aa in c("Q","N")){ acol[aa] = soton$col[1] }
        for (aa in c("X","x",".")){ acol[aa] = soton$col[6] }
        for (aa in c("0","1","2","3","4","5","6","7","8","9","[","]",",","{","}")){ acol[aa] = "white" }
        for (aa in c("-")){ acol[aa] = "black" }
        for (aa in c("#","^","$","+")){ acol[aa] = soton$col[2] }

        ## ~ Revised CWColour paletter ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
        for (aa in c("I","L","V","IFLMV","IFLM","IFMV","IFLV","FLMV","ILMV","IL","IV","ILV","ILF")){ acol[aa] = cwcol("I",c("#")) }
        for (aa in c("A","M")){ acol[aa] = cwcol("I",c("#")) }
        for (aa in c("AG","GS","AS","AGS")){ acol[aa] = cwcol("A",c("G")) }
        for (aa in c("K","R","KR")){ acol[aa] = cwcol("K",c("+")) }
        for (aa in c("D","E","DE")){ acol[aa] = cwcol("D",c("-")) }
        for (aa in c("S","T","ST")){ acol[aa] = cwcol("S",c("S")) }
        for (aa in c("C")){ acol[aa] = cwcol("C",c("C")) }
        for (aa in c("P")){ acol[aa] = cwcol("P",c("P")) }
        for (aa in c("F","Y","W","FY","FW","WY","FWY")){ acol[aa] = cwcol(substr(aa,1,1),c("#")) }
        for (aa in c("G")){ acol[aa] = cwcol("G",c("G")) }
        for (aa in c("H","HK","HR","HKR")){ acol[aa] = cwcol("H",c("H")) }
        for (aa in c("HY","FH","FHY")){ acol[aa] = cwcol("F",c("#")) }
        for (aa in c("Q","N")){ acol[aa] = cwcol(aa,c(aa)) }

        #ppcol = c("black",soton$col[1:11],soton$col[13:16],soton$col[18:21])
        ppcol = c("black",soton$col[1:11],soton$col[13:21])



        ## ~ ClustalW Alignment colours ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ##
        # color lookup table - this is optional, if no rgbindex is specified, 8
        # hardcoded colors will be used.
        # A maximum of 16 colors can be specified - any more will be ignored!
        # @rgbindex
        # RED 0.9 0.2 0.1
        # BLUE 0.1 0.5 0.9
        # GREEN 0.1 0.8 0.1
        # CYAN 0.0 0.7 0.7
        # PINK 0.9 0.5 0.5
        # MAGENTA 0.8 0.3 0.8
        # YELLOW 0.8 0.8 0.0
        # ORANGE 0.9 0.6 0.3
        rgbindex = list()
        rgbindex["RED"] = rgb(0.9, 0.2, 0.1)
        rgbindex["BLUE"] = rgb(0.1, 0.5, 0.9)
        rgbindex["GREEN"] = rgb(0.1, 0.8, 0.1)
        rgbindex["CYAN"] = rgb(0.0, 0.7, 0.7)
        rgbindex["PINK"] = rgb(0.9, 0.5, 0.5)
        rgbindex["MAGENTA"] = rgb(0.8, 0.3, 0.8)
        rgbindex["YELLOW"] = rgb(0.8, 0.8, 0.0)
        rgbindex["ORANGE"] = rgb(0.9, 0.6, 0.3)

        # :: @consensus
        cwcons = function(aalist){
        acons = c()
        for(aa in c("A","C","D","E","F","G","H","I","K","L","M","N","P","Q","R","S","T","V","W","Y")){
            ax = sum(aalist == aa)
            acut = 0.85 * length(aalist)
            if(ax >= acut){ acons = c(acons,aa); }
        }
        nonpol = 0
        for(aa in c("A","C","F","H","I","L","M","P","V","W","Y")){
            nonpol = nonpol + sum(aalist == aa)
        }
        if(nonpol > 0.6 * length(aalist)){ acons = c(acons,"%"); }
        if(nonpol > 0.8 * length(aalist)){ acons = c(acons,"#"); }
        if(sum(aalist == "E" | aalist == "D") >= (0.5 * length(aalist))) { acons = c(acons,"-"); }
        if(sum(aalist == "K" | aalist == "R") >= (0.6 * length(aalist))) { acons = c(acons,"+"); }
        if(sum(aalist == "G") >= (0.5 * length(aalist))) { acons = c(acons,"g"); }
        if(sum(aalist == "N") >= (0.5 * length(aalist))) { acons = c(acons,"n"); }
        if(sum(aalist == "Q" | aalist == "E") >= (0.5 * length(aalist))) { acons = c(acons,"q"); }
        if(sum(aalist == "P") >= (0.5 * length(aalist))) { acons = c(acons,"p"); }
        if(sum(aalist == "S" | aalist == "T") >= (0.5 * length(aalist))) { acons = c(acons,"s","t"); }
        return(acons)
        }
        # :: % = 60% w:l:v:i:m:a:f:c:y:h:p
        # :: # = 80% w:l:v:i:m:a:f:c:y:h:p
        # :: - = 50% e:d
        # :: + = 60% k:r
        # :: g = 50% g
        # :: n = 50% n
        # :: q = 50% q:e
        # :: p = 50% p
        # :: t = 50% t:s
        # :: s = 50% t:s
        # :: A = 85% a
        # :: C = 85% c
        # :: D = 85% d
        # :: E = 85% e
        # :: F = 85% f
        # :: G = 85% g
        # :: H = 85% h
        # :: I = 85% i
        # :: K = 85% k
        # :: L = 85% l
        # :: M = 85% m
        # :: N = 85% n
        # :: P = 85% p
        # :: Q = 85% q
        # :: R = 85% r
        # :: S = 85% s
        # :: T = 85% t
        # :: V = 85% v
        # :: W = 85% w
        # :: Y = 85% y

        cwcontains = function(wanted,observed){
        for(type in wanted){
            if(sum(observed == type) > 0){ return(TRUE) }
        }
        return(FALSE)
        }

        # :: @color
        # :: g = ORANGE
        # :: p = YELLOW
        # :: t = GREEN if t:S:T:%:#
        # :: s = GREEN if t:S:T:#
        # :: n = GREEN if n:N:D
        # :: q = GREEN if q:Q:E:+:K:R
        # :: w = BLUE if %:#:A:C:F:H:I:L:M:V:W:Y:P:p
        # :: l = BLUE if %:#:A:C:F:H:I:L:M:V:W:Y:P:p
        # :: v = BLUE if %:#:A:C:F:H:I:L:M:V:W:Y:P:p
        # :: i = BLUE if %:#:A:C:F:H:I:L:M:V:W:Y:P:p
        # :: m = BLUE if %:#:A:C:F:H:I:L:M:V:W:Y:P:p
        # :: a = BLUE if %:#:A:C:F:H:I:L:M:V:W:Y:P:p:T:S:s:G
        # :: f = BLUE if %:#:A:C:F:H:I:L:M:V:W:Y:P:p
        # :: c = BLUE if %:#:A:F:H:I:L:M:V:W:Y:S:P:p
        # :: c = PINK if C
        # :: h = CYAN if %:#:A:C:F:H:I:L:M:V:W:Y:P:p
        # :: y = CYAN if %:#:A:C:F:H:I:L:M:V:W:Y:P:p
        # :: e = MAGENTA if -:D:E:q:Q
        # :: d = MAGENTA if -:D:E:n:N
        # :: k = RED if +:K:R:Q
        # :: r = RED if +:K:R:Q
SVG.svgAlignment(self)
    .
SVG.svgFile(self,svgtext,filename='',width='100%',height='100%')
    Returns SVG code wrapped in header and footer. Saves if filename given.
    >> svgtext:str = Main body of SVG file
    >> filename:str [''] = Filename to save to. If no filename given, will not save.
SVG.svgHTML(self,svglink,title,svgfile=None,height=1600,width=1600)
    Returns HTML Code for SVG file embedding.
SVG.svgTree(self,basefile='',data={},treesplit=0.5,font=12,maxfont=20,width=1600,height=0,save=True,xoffset=0,yoffset=0,internal_labels='boot')
    Generate SVG Tree. Based on rje_ppi.r

rje_svg Module Methods

rje_svg.runMain()




rje_svg Module ToDo Wishlist

    # [ ] : List here

rje_tm [version 1.2] Tranmembrane and Signal Peptide Prediction Module ~ (rje_tm.py)[Top]

Module: rje_tm
Description: Tranmembrane and Signal Peptide Prediction Module
Version: 1.2
Last Edit: 16/08/07
Imports: rje, rje_seq, rje_zen
Imported By: unifake, compass, rje_ensembl
Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice

Function:
Will read in results from tmhmm and/or signalp files as appropriate and append output to:
- tm.tdt = TM domain counts and orientation
- domains.tdt = Domain table
- singalp.tdt = SingalP data (use to add signal peptide domains to domains table using mySQL

Commandline:
* tmhmm=FILE : TMHMM output file [None]
* signalp=FILE: SignalP output file [None]
* mysql=T/F : Output results in tdt files for mySQL import [True]

* seqin=FILE : Sequence file for which predictions have been made [None]
* maskcleave=T/F : Whether to output sequences with cleaved signal peptides masked. [False]
* source=X : Source text for mySQL file ['tmhmm']

Uses general modules: os, string, sys, threading, time
Uses RJE modules: rje, rje_seq
Other modules required: rje_blast, rje_dismatrix, rje_pam, rje_sequence, rje_uniprot

rje_tm Module Version History

    # 0.0 - Initial Compilation.
    # 1.0 - Basic functional version.
    # 1.1 - Output of 'truncated' proteins - cleaved signalp peptides replaced with Xs.
    # 1.2 - Added standalone parsing methods as first step in development (& for rje_ensembl.ensDat())

TM Class

    TM Class. Author: Rich Edwards (2005).

    Info:str
    - Name = Sequence file for which predictions have been made [None]
    - TMHMM = File for TMHMM output
    - SignalP = File for SignalP output
    - Source = Source text for mySQL file ['tmhmm']
    
    Opt:boolean
    - MySQL = Whether to output files to mySQL tables (*.tdt)
    - MaskCleave = Whether to output sequences with cleaved signal peptides masked. [False]

    Stat:numeric

    Obj:RJE_Objects
    - SeqList = rje_seq.SeqList object.
TM._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
TM._setAttributes(self)
    Sets Attributes of Object:
    - Info:str ['Name','TMHMM','SignalP','Source']
    - Stats:float []
    - Opt:boolean ['MySQL','MaskCleave']
    - Obj:RJE_Object []
TM.getDomains(self,acc)
    Returns domains for a given sequence = list of dictionaries ['Type','Start','End']
    >> file:str = will read from file if given, else self.info['TMHMM']
TM.maskCleave(self)
    Outputs masked cleavage sequences to file.
TM.mySQLOut(self,tmfile='tm.tdt',domfile='domains.tdt',sigfile='signalp.tdt',makenew=False)
    Output to tdt files.
    >> tmfile:str = File to save TM numbers in (no save if None)
    >> domfile:str = File to save domain data in (no save if None)
    >> sigfile:str = File to save choice SignalP data in (no save if None)
    >> makenew:boolean [False] = whether to make new files (True) or append (False)
TM.parseSignalP(self,file=None)
    Parses SignalP into dictionary self.signalp. This takes heavily from Paul's SignalP module. SignalP should be
    run using the "-f short" option to give simple one-line predictions per protein.
    >> file:str = will read from file if given, else self.info['SignalP']
TM.parseTMHMM(self,file=None)
    Parses TMHMM into dictionary self.tmhmm.
    >> file:str = will read from file if given, else self.info['TMHMM']

rje_tm Module Methods

rje_tm.domainList(tmdict)
    Returns list of TMHMM domain dictionaries {Type/Start/End} from Topology string.
rje_tm.parseTMHMM(tmline)
    Returns a dictionary of TMHMM data from a TMHMM line.
rje_tm.runMain()




rje_tm Module ToDo Wishlist

    # [ ] : Commandline option for running TMHMM and SignalP
    # [ ] : Forking for TMHMM and SignalP -> split input, fork, compile, delete
    # [ ] : Remodel entire TM class to be in line with new module structure

rje_tree [version 2.9] Phylogenetic Tree Module ~ [Top]

Module: rje_tree
Description: Phylogenetic Tree Module
Version: 2.9
Last Edit: 07/12/11
Imports: rje, rje_ancseq, rje_html, rje_seq, rje_svg, rje_tree_group
Imported By: badasp, budapest, fiesta, gablam, gasp, gfessa, gopher_V2, happi, haqesac, pingu, ned_SLiMPrints, ned_SLiMPrints_Tester, ned_conservationScorer, rje_slimhtml, slimjim, slimprints
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
Reads in, edits and outputs phylogenetic trees. Executes duplication and subfamily determination. More details available
in documentation for HAQESAC, GASP and BADASP at http://www.bioinformatics.rcsi.ie/~redwards/

General Commands:
* nsfin=FILE : load NSF tree from FILE
* phbin=FILE : load ClustalW Format *.phb NSF tree from FILE
* seqin=FILE : load sequence list from FILE (not compatible with useanc)
* disin=FILE : load distance matrix from FILE (Phylip format for use with distance matrix methods) [None]
* useanc=FILE : load sequences from ancestral sequence FILE (not compatible with seqin)
* deflen=X : Default length for branches when no lengths given [0.1] (or 0.1 x longest branch)
*Note that in the case of conflicts (e.g. seqin=FILE1 useanc=FILE2), the latter will be used.*
* autoload=T/F : Whether to automatically load sequences upon initiating object [True]

Rooting Commands:
* root=X : Rooting of tree (rje_tree.py):
- mid = midpoint root tree.
- ran = random branch.
- ranwt = random branch, weighted by branch lengths.
- man = always ask for rooting options (unless i<0).
- none = unrooted tree
- FILE = with seqs in FILE as outgroup. (Any option other than above)
* rootbuffer=X : Min. distance from node for root placement (percentage of branch length)[0.1]

Grouping/Subfamily Commands:
* bootcut=X : cut-off percentage of tree bootstraps for grouping.
* mfs=X : minimum family size [3]
* fam=X : minimum number of families (If 0, no subfam grouping) [0]
* orphan=T/F : Whether orphans sequences (not in subfam) allowed. [True]
* allowvar=T/F: Allow variants of same species within a group. [False]
* qryvar=T/F : Allow variants of query species within a group (over-rides allowvar=F). [False]
* groupspec=X : Species for duplication grouping [None]
* specdup=X : Minimum number of different species in clade to be identified as a duplication [1]
* group=X : Grouping of tree
- man = manual grouping (unless i<0).
- dup = duplication (all species unless groupspec specified).
- qry = duplication with species of Query sequence (or Sequence 1) of treeseq
- one = all sequences in one group
- None = no group (case sensitive)
- FILE = load groups from file

Tree Making Commands:
* cwtree=FILE : Make a ClustalW NJ Tree from FILE (will save *.ph or *.phb) [None]
* kimura=T/F : Whether to use Kimura correction for multiple hits [True]
* bootstraps=X : Number of bootstraps [0]
* clustalw=CMD : Path to CLUSTALW (and including) program ['c:/bioware/clustalw.exe'] * Use forward slashes (/)
* fasttree=PATH : Path to FastTree (and including) program [./FastTree]
* phylip=PATH : Path to PHYLIP programs ['c:/bioware/phylip3.65/exe/'] * Use forward slashes (/)
* phyoptions=FILE : File containing extra Phylip tree-making options ('batch running') to use [None]
* protdist=FILE : File containing extra Phylip PROTDIST options ('batch running') to use [None]
* maketree=X : Program for making tree [None]
- None = Do not make tree from sequences
- clustalw = ClustalW NJ method
- neighbor = PHYLIP NJ method
- upgma = PHYLIP UPGMA (neighbor) method
- fitch = PHYLIP Fitch method
- kitsch = PHYLIP Kitsch (clock) method
- protpars = PHYLIP MP method
- proml = PHYLIP ML method
- fasttree = Use FastTree
- PATH = Alternatively, a path to a different tree program/script can be given. This should accept ClustalW parameters.

Tree Display/Saving Commands
* savetree=FILE : Save a generated tree as FILE [seqin.maketree.nsf]
* savetype=X : Format for generated tree file (nsf/nwk/text/r/png/bud/qspec/cairo/te/svg/html) [nsf]
* treeformats=LIST: List of output formats for generated trees [nsf]
* outnames=X : 'short'/'long' names in output file [short]
* truncnames=X : Truncate names to X characters (0 for no truncation) [123]
* branchlen=T/F : Whether to use branch lengths in output tree [True]
* deflen=X : Default branch length (when none given, also for tree scaling) [0.1]
* textscale=X : Default scale for text trees (no. of characters per deflen distance) [4]
* seqnum=T/F : Output sequence numbers (if making tree from sequences) [True]

Classes:
Tree(rje.RJE_Object):
- Phylogenetic Tree class.
Node(rje.RJE_Object):
- Individual nodes (internal and leaves) for Tree object.
Branch(rje.RJE_Object):
- Individual branches for Tree object.

Uses general modules: copy, os, random, re, string, sys, time
Uses RJE modules: rje, rje_ancseq, rje_seq, rje_tree_group
Other module needed: rje_blast, rje_dismatrix, rje_pam, rje_sequence, rje_uniprot

rje_tree Module Version History

    # 0.0 - Initial Compilation.
    # 0.1 - Changed CladeList to lists inside tuple
    # 0.2 - Completely reworked with a much greater OO focus
    # 0.3 - No Out Object in Objects
    # 1.0 - Better
    # 1.1 - Added tree-drawing with ClustalW
    # 1.2 - Bug fixing.
    # 1.3 - Added separate maketree=PATH command to enable replacement of ClustalW tree drawing
    # 1.4 - Added toggling of full sequence description in TextTree/EditTree
    # 1.5 - Added ability to read in integer branch lengths
    # 1.6 - Added PHYLIP tree making
    # 1.7 - Modified text/log output. Added commandline scale option for text trees.
    # 1.8 - Updating reporting of redundant duplication finding.
    # 1.9 - Modified mapSeq() method to be more robust to different formats
    # 1.10- Fixed some bugs and had a minor tidy.
    # 2.0 - Updated code to be more inline with newer RJE modules. Fixed some bugs.
    # 2.1 - Added tree savetype
    # 2.2 - Added treeformats=LIST option to eventually replace savetype. Added te (TreeExplorer) format.
    # 2.3 - Added specdup=X   : Minimum number of different species in clade to be identified as a duplication [1]
    # 2.4 - Added fasttree generation of trees.
    # 2.5 - Added qryvar=T/F  : Allow variants of query species within a group (over-rides allowvar=F). [False]
    # 2.6 - Added PNG R variants.
    # 2.7 - Added Improved NSF Tree reading.
    # 2.8 - Added SVG and HTML Tree output.
    # 2.9 - Added NWK output (=NSF output with different extension for MEGA!)

Branch Class

    Individual branches for Tree object. Author: Rich Edwards (2005).

    Info:str
    - Name = Name of Branch
    
    Stat:numeric
    - Bootstrap = Bootstrap Support (-1 = none)
    - Length = Branch Length (-1 = none)
    - PAM = Branch PAM (-1 = none)
    
    Obj:RJE_Objects
    - Sequence = rje_seq.Sequence object

    Other:
    - node = list of two nodes connected to branch
Branch._setAttributes(self)
    Sets Attributes of Object.
Branch.ancNode(self)
    Returns 'ancestral' node.
Branch.combine(self,node,branch)
    Combines data from another branch, usually during unrooting.
    Will warn if bootstraps are not compatible and die if node is incorrect.
    >> node:Node object = common node between branches
    >> branch:Branch object = other branch
Branch.commonNode(self,branch)
    Returns common node with other branch.
    >> branch:Branch Object
    << node:Node Object or None if not common.
Branch.descNode(self)
    Returns 'descendant' node.
Branch.link(self,node)
    Returns other end of branch.
    >> node:Node object
    << link:Node object
Branch.show(self)
    Returns Node numbers X -> Y.

Node Class

    Individual nodes (internal and leaves) for Tree object. Author: Rich Edwards (2005).

    Info:str
    - Name = Name of Node
    - CladeName = Name to be used if describing descendant clade
    - Type = Type of node = Internal/Terminal/Root/(Duplication)
    
    Opt:boolean
    - Duplication = Whether the node is a duplication node
    - Compress = Whether to compress clade in Tree.textree()
    - SpecDup = Wheter node is a species-specific(ish) duplication

    Stat:numeric
    - ID = node number

    Obj:RJE_Objects
    - Sequence = rje_seq.Sequence object

    Other:
    - branch = list of branches (1 for terminal, 2 for root, 3 for internal)
Node._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
Node._setAttributes(self)
    Sets Attributes of Object.
Node.ancBranch(self)
    Returns branch that links node with 'ancestor' or None if none.
    << branch:Branch Object
Node.ancNode(self)
    Returns linked node that is 'ancestral' or None if none
    << node:Node Object
Node.flipDesc(self)
    Flips 'Descendant' Nodes by swapping Branches in self.branch.
Node.link(self,othernode)
    Returns branch that links node with self or None if none.
    >> othernode:Node Object to link
    << link:Branch Object
Node.mapSeq(self,seq=None,id=0)
    Maps a rje_seq.Sequence object onto Node.
    >> seq:rje_seq.Sequence object
    >> id:int = order of sequence in SeqList (for output clarity only)
Node.neighbours(self,ignore=[])
    Returns list of Node objects linked by a single branch.
    >> ignore:list of Node objects
    << neighbours:list of Node Objects
Node.rename(self,rooting=None)
    Gives internal nodes a good name based on numbering.
    >> rooting:str = method of rooting if to be added.
Node.setType(self)
    Sets the node's type based on link and duplication.
Node.shortName(self)
    Returns short name = first word of name.

Tree Class

    Phylogenetic Tree class. Author: Rich Edwards (2005).

    Info:str
    - Name = Name of tree (usually filename)
    - Type = Method of construction (if known)
    - Rooting = Rooting strategy [man]
    - RootMethod = Method used to determine root
    - GroupSpecies = Species to be used for Grouping
    - Grouping = Method used to determine groups
    - ClustalW = Path to ClustalW
    - Phylip = Path to PHYLIP programs ['c:/bioware/phylip3.65/exe/'] * Use forward slashes (/)
    - PhyOptions = File containing extra Phylip options ('batch running') to use [None]
    - ProtDist = File containing extra Phylip PROTDIST ('batch running') to use [None]
    - MakeTree = Tree drawing program (ClustalW options)
    - CWTree = Whether to use ClustalW NJ to make tree in self.makeTree()
    - FastTree = Path to FastTree (and including) program [./FastTree]
    - DisIn = load distance matrix from FILE (for use with distance matrix methods) [None]
    - SaveTree = Save a generated tree as FILE [None]
    - SaveType = Format for generated tree file (nsf/text/r/png/bud/cairo) [nsf]
    - OutNames = 'short'/'long' names in output file [short]
               
    Opt:boolean
    - Rooted = Whether tree is rooted
    - ReRooted = Whether tree has had its root altered
    - Branchlengths = Whether tree has branchlengths
    - Bootstrapped = Whether tree has bootstraps
    - QueryGroup = Group using specified Query Species
    - Orphans = Whether 'orphan' sequences allowed
    - AllowVar = Allow variants of same species within a group. [False]
    - QryVar = Allow variants of query species within a group (over-rides allowvar=F). [False]
    - Kimura = Whether to use Kimura multiple hit correction
    - OutputBranchLen = Whether to use branch lengths in output tree [True]
    - AutoLoad = Whether to automatically load sequences upon initiating object [True]
    - SeqNum = Output sequence numbers for general make tree thing [True]

    Stat:numeric
    - DefLen = Default length for branches when no lengths given [0.1]
    - TextScale = Default scale for text trees (no. of characters per deflen distance) [4]
    - RootBuffer = Min. distance from node for root placement (percentage of branch length)[0.1]
    - SeqNum = Number of seqs (termini)
    - Bootstraps = Number of bootstraps (if known)
    - BootCut = cut-off percentage of tree bootstraps for grouping.
    - MinFamSize = minimum family size [2]
    - MinFamNum : minfamnum = 0   # minimum number of families (If 0, no subfam grouping)
    - SpecDup = Minimum number of different species in clade to be identified as a duplication [1]
    - TruncNames = Truncate names to X characters (0 for no truncation) [123]

    List:list
    - TreeFormats = List of output formats for generated trees [nsf]

    Dict:dictionary    

    Obj:RJE_Objects
    - SeqList = rje_seq.SeqList object
    - PAM = rje_pam.PamCtrl object

    Other:
    - node = List of Node Objects
    - branch = List of Branch Objects
    - subfam = List of Node Objects that specifiy subgroup clades

    Additional Commands:
    - nsfin=FILE = load NSF tree from FILE
Tree._addGroup(self,node)
    See rje_tree_group._addGroup().
Tree._autoGroups(self,method='man')
    See rje_tree_group._autoGroups().
Tree._bestCladeNode(self,nodelist)
    Returns the best ancestral node for the nodelist clade given.
    - include all seqs in clade but minimise others.
    >> nodelist:list of node objects
    << cladenode:Node object
Tree._changeRoot(self)
    Gives options to Re-root (or unroot) Tree.
    Returns True if root changes or False if no change.
    Should call treeRoot() from other modules - will loop this method as appropriate.
Tree._checkGroupNames(self)


Tree._checkGroups(self)
    See rje_tree_group._checkGroups().
Tree._checkTree(self)
    Checks integrity of tree
    - that nodes and branches match
    - nodes have correct number of branches (1-3)
    - branches have correct number of nodes (2)
    << True if OK. False if Bad.
Tree._clearGroups(self)
    See rje_tree_group._clearGroups().
Tree._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
Tree._cutCmdList(self,cmds=None)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
Tree._descClades(self,node,internal=False)
    Returns tuple of lists of both descendant clades.
    >> node:Node Object = ancestral node
    >> internal:bool [False] = whether to return internal nodes as well as termini
    << tuple:lists of node objects in descendant clades
Tree._dupGroup(self)
    See rje_tree_group._dupGroup().
Tree._getNode(self,id)
    Returns Node Object with given ID.
    >> id:int = node.stat['ID']
    << node:Node Object
Tree._getRootNode(self)
    Returns root node or None if not rooted.
Tree._groupChoice(self)
    See rje_tree_group._groupChoice().
Tree._groupDisSum(self,seq1,seq2,text='')
    See rje_tree_group._groupDisSum().
Tree._groupRules(self)
    See rje_tree_group._groupRules().
Tree._grpSeqSort(self,seqs=[],compseq=None)
    See rje_tree_group._grpSeqSort().
Tree._grpVarDel(self,delseq=[],kept=None,fam=None)
    See rje_tree_group._grpVarDel().
Tree._isAnc(self,an,dn)
    Returns boolean whether an is descendant of dn.
    >> an:Node = putative ancestral node 
    >> dn:Node = putative descendant node
    << anc:boolean = whether an is ancestral of dn
Tree._loadGroups(self,filename='rje_tree.grp')
    See rje_tree_group._loadGroups().
Tree._makeNSFTree(self,seqnum=False,seqname='short',maxnamelen=123,blen='Length',bootstraps='boot',multiline=True,te=True)
    Generates an NSF Tree from tree data.
    >> seqnum:boolean [False] = whether to print sequence numbers
    >> seqname:str ['short'] = name to use for sequences
    - 'num' = numbers only; 'short' = short sequence names; 'long' = long names;  
    >> maxnamelen:int [123] = truncate names to this length for compatability with other programs.
    >> blen:str ['Length'] = stat to use for branch lengths 
    - 'none' = do not use branch lengths; 'fix' = fix at _deflen;
    - 'pam' = replace with PAM distances if calculated (else none)
    - other = use node.stat[blen]
    >> bootstraps:str ['boot'] = what to print for bootstraps
    - 'none' = nothing; 'boot' = boostraps if given (else none); 'node' = node numbers
    >> multiline:boolean [True] = whether to spread file over multiple lines (1 per node)
    >> te:boolean [True] = make TreeExplorer compatible
    - whitespaces are replaced with underscores, brackets with '' and colons/semicolons with -
    << outtree:str = NSF tree
Tree._maxBoot(self,b,maxb)
    Returns predicted number of bootstraps assuming a power of ten.
    >> b:int = branch bootstrap
    >> max:int = current max
Tree._nodeClade(self,node,internal=False)
    Returns list of descendant clade.
Tree._nodeSeqs(self,nodelist)
    Returns list of descendant clade.
Tree._nodesNumbered(self,check1toN=False)
    Whether sequence names are purely numbers.
    >> check1toN:boolean = whether to also check that the numbers are 1 to N [False]
    << Returns True or False
Tree._orphanCount(self)
    See rje_tree_group._orphanCount().
Tree._prune(self,branch=None,remtext='Manual Tree Pruning')
    Removes branch and all descendants.
    >> branch:Branch object
    >> remtext:str = text description for reason of sequence removal
Tree._purgeOrphans(self)
    See rje_tree_group._purgeOrphans().
Tree._regenerateSeqList(self,seqlist=None,nodelist=[],id=False)
    Reorders/resizes SeqList to match given nodelist.
    >> seqlist:rje_seq.SeqList Object
    >> nodelist:list of Node Objects, which may or may not have Sequence objects associated with them.
    >> id:boolean [False] = Whether to append Node ID to front of name (for AncSeqOut)
Tree._renumberNodes(self)
    Renumbers nodes from root.
Tree._reorderNodes(self)
    Renumbers nodes from root.
Tree._reorderSeqToGroups(self)
    See rje_tree_group._reorderSeqToGroups().
Tree._resetGroups(self)
    See rje_tree_group._resetGroups().
Tree._reviewGroups(self,interactive=1)
    See rje_tree_group._reviewGroups().
Tree._saveGroups(self,filename='rje_tree.grp',groupnames=True)
    See rje_tree_group._saveGroups().
Tree._setAttributes(self)
    Sets Attributes of Object.
Tree._sumGroups(self)
    See rje_tree_group._sumGroups().
Tree._vertOrder(self,fromnode=None,compress=False,internal=True,namelist=False)
    Returns vertical (tree) ordering of nodes.
    >> fromnode:Node Object = root of tree (Actual root if None)
    >> compress:bool [False] = whether to compress 'compressed' nodes
    >> internal:bool [True] = whether to return internal nodes (or just leaves)
    >> namelist:bool [False] = whether to return list of names rather than node objects
Tree.ancSeqOut(self,file=None,type='fas',ordered=False)
    Output of ancestral sequences.
    Makes a new SeqList object and outputs it.
    >> file:str [None] = filename
    >> type:str ['fas'] = type of file
    >> ordered:int [0] = whether to order nodes after seqs. (Default intersperses.)
Tree.branchClades(self,branch,internal=False)
    Returns lists of sequences either side of branch, (anc,desc)
    >> branch:Branch
    >> internal:bool [False] = whether to return internal nodes as well as termini
    << tuple:lists of Node Objects
Tree.branchNum(self)


Tree.branchPam(self,pam=None)
    Calculates PAM distances for branches.
    >> pam:rje_pam.PamCtrl Object [None]
Tree.branchRoot(self,nextb=0,inroot=None)
    Manually places root on branch. Option to save outgroup.
Tree.branches(self)


Tree.buildTree(self,nsftree,seqlist=None,type='nsf',postprocess=True)
    Builds tree from NSF string.
    >> nsftree:str = Newick Standard Format tree
    >> seqlist:SeqList object to map onto tree
    >> type:str = Type (Format) of tree
    >> postprocess:boolean = Whether to re-root tree and identify duplications [True]
Tree.clearSeq(self)
    Clears current sequence information.
Tree.debugTree(self)


Tree.dictTree(self,seqname='short',nodename='long',blen='Length',fromnode=None,compress=True,title=None,qry='qry')
    Returns tree as data dictionary for SVG (and R) output.
    >> filename:str [None]
    >> seqname:str ['short'] = name to use for sequences
    - 'num' = numbers only; 'short' = short sequence names; 'long' = long names
    >> nodename:str ['num'] = how to label nodes
    - 'none' = no label, 'num' = numbers only, 'short' = short names, 'long' = long names
    >> blen:str ['Length'] = branch lengths 
    - 'none' = do not use branch lengths; 'fix' = fix at _deflen;
    - 'pam' = replace with PAM distances if calculated (else none)
    - other = use node.stat[blen]
    >> fromnode:Node Object [None] = draw subtree from Node
    >> compress:boolean = whether to compress compressed nodes
Tree.editTree(self,groupmode=False,fromnode=None,reroot=True)
    Options to alter tree details and compress nodes etc.
    - Expand/Collapse nodes
    - Flip Clade
    - Zoom to Clade (show subset of tree)
    - Prune Tree (permanently delete branches and nodes)
    - Edit Branch
    - Edit Terminal Node
    - ReRoot/Unroot
    - Display Options
    - Save Text Tree
    >> groupmode:Boolean [False] = whether node compression defines groups
    >> fromnode:Node Object [None] = start by drawing tree from given Node only (Zoom)
    >> reroot:Boolean [True] = whether to allow rerooting of tree
Tree.fileRoot(self,rootfile,inroot=False)
    Places root on unrooted tree based on outgroup sequences.
    >> rootfile:str = filename
    >> inroot:boolean = whether tree already rooted (for Exception return)
    << returns True if rooting changed, False if not
Tree.findDuplications(self,species=None,duptext='')
    Finds Duplication nodes in tree.
    >> species:str = mark duplication for this species only [None]
Tree.fullDetails(self,nodes=True,branches=True)
    Displays details of SeqList and all Sequences.
Tree.groupNum(self)


Tree.htmlTree(self,filename,seqname='short',nodename='long',blen='Length',fromnode=None,compress=True,title=None,qry='qry',treesplit=0.5,font=12,maxfont=20,width=1600,height=0,xoffset=0,yoffset=0)
    Generates a data dictionary similar to that output to a file to be read in and processed by R for SVG generation.
    See rje_tree.dictTee() and rje_svg.svgTree() for parameters.
Tree.indelTree(self,filename='indel.txt')
    Produces a text tree with indels marked.
    > filename:str = output file name
Tree.loadTree(self,file='tree',type='nsf',boot=0,seqlist=None,postprocess=True)
    Calls appropriate method to load a tree from a certain format of file.
    >> file:str ['tree'] = filename
    >> type:str ['nsf'] = Type of file (e.g. nsf)
    - 'nsf' = Newick Standard Format
    - 'phb' = Newick Standard Format from ClustalW
    - 'sim' = build from *.sim.anc.fas file of GASP test simulation data
    >> boot:int [0] = number of bootstraps, if important (0 = ignore bootstraps)
    >> seqlist:SeqList object to map onto tree
    >> postprocess:boolean = Whether to re-root tree and identify duplications [True]
Tree.makeRootFile(self)
    Makes outgroup list and places in file. Gives option to root from it.
    << False if no rooting, True if rooted.
Tree.makeTree(self,make_seq=None,keepfile=True)
    Uses attributes to call program and make tree.
    >> make_seq:SeqList object from which to make tree
    >> keepfile:bool = whether to keep the tree file produced by makeTree or delete
Tree.makeTreeMenu(self,interactiveformenu=0,force=False,make_seq=None)
    Menu for making tree.
    >> interactiveformenu:int = Interactive level at which to give menu options [0]
    >> force:boolean = Whether to force making of tree (not allow exit without) [False]
    >> make_seq:SeqList object from which to make tree
Tree.manRoot(self)
    Gives manual choices for rooting.
    << False if unchanged, True if changed.
Tree.mapAncSeq(self,filename=None)
    Maps Ancestral Sequences to tree.
    >> filename:str = name of ancestral sequence file. Should be in GASP output format.
Tree.mapSeq(self,seqlist=None,clearseq=True)
    Maps SeqList object onto Tree.
    >> seqlist:rje_seq.SeqList Object.
    >> clearseq:boolean = whether to clear any current node sequence before matches
        - if matching from multiple SeqLists then set to False for 2nd and subsequent matches
        *** It is best to combine into a single SeqList object, which can be linked to the tree object. ***

    Names should either include tree sequence names (at start of name) or be ordered according to tree sequences (1-N).
    Will take sequences from a larger seqlist than the tree and try to find each node, first with an exact match and then partial.
Tree.midRoot(self,fixlen=0.1)
    Midpoint roots tree.
    Returns True if successful and False if fails.
Tree.nodeList(self,nodelist,id=False,space=True)
    Returns a text summary of listed nodes '(X,Y,Z)' etc.
    >> nodelist:list of Node Objects
    >> id:Boolean [False] = whether to use node IDs (True) or names (False)
    >> space:Boolean [True] = whether to have a space after commas for clarity
    << sumtxt:str shortnames of listed nodes (X,Y,Z...)
Tree.nodeNum(self)


Tree.nodes(self)


Tree.outGroupNode(self,node)
    Returns 'outgroup' to node = other descendant of ancestral node.
    >> node:Node Object
Tree.pathLen(self,path,stat='Length')
    Returns length of path as defined by given stat.
    >> path:list of Branch objects
    >> stat:key for branch.stat dictionary
    << dis:float = total length as float
Tree.pathLink(self,node1,node2)
    Returns list of branches linking nodes.
    >> node1:Node Object
    >> node2:Node Object
    >> retry:boolean  = whether to retry if fails
    << list of branches in order node1 -> node2,
Tree.pathSummary(self,pathin)
    Returns summary of path as Node numbers X -> Y etc.
    >> path:list of branch objects
    << summary:str = text summary of path
Tree.phylipTree(self,make_id,make_seq,make_dis)
    >> make_id:random string for temp phylip directory
    >> make_seq:SeqList object from which to make tree
    >> make_dis:Distance Matrix file to build tree from
Tree.placeRoot(self,method,root,fraction=None)
    Places root on unrooted tree.
    >> method:str = Method of rooting
    >> root:Branch = branch on which to place root
    >> fraction:float = fraction along branch to place root anc -> desc
Tree.pngTree(self,filename,seqname='short',nodename='long',blen='Length',fromnode=None,compress=True,title=None,type='png')
    Outputs details of tree to a file to be read in and processed by R.
    >> filename:str [None]
    >> seqname:str ['short'] = name to use for sequences
    - 'num' = numbers only; 'short' = short sequence names; 'long' = long names
    >> nodename:str ['num'] = how to label nodes
    - 'none' = no label, 'num' = numbers only, 'short' = short names, 'long' = long names
    >> blen:str ['Length'] = branch lengths 
    - 'none' = do not use branch lengths; 'fix' = fix at _deflen;
    - 'pam' = replace with PAM distances if calculated (else none)
    - other = use node.stat[blen]
    >> fromnode:Node Object [None] = draw subtree from Node
    >> compress:boolean = whether to compress compressed nodes
Tree.rTree(self,filename,seqname='short',nodename='long',blen='Length',fromnode=None,compress=True,title=None,qry='qry')
    Outputs details of tree to a file to be read in and processed by R.
    >> filename:str [None]
    >> seqname:str ['short'] = name to use for sequences
    - 'num' = numbers only; 'short' = short sequence names; 'long' = long names
    >> nodename:str ['num'] = how to label nodes
    - 'none' = no label, 'num' = numbers only, 'short' = short names, 'long' = long names
    >> blen:str ['Length'] = branch lengths 
    - 'none' = do not use branch lengths; 'fix' = fix at _deflen;
    - 'pam' = replace with PAM distances if calculated (else none)
    - other = use node.stat[blen]
    >> fromnode:Node Object [None] = draw subtree from Node
    >> compress:boolean = whether to compress compressed nodes
Tree.randomRoot(self)
    Places root on random branch, ignoring branch lengths.
Tree.randomWtRoot(self)
    Places root on random branch, weighted by branch lengths.
Tree.reformatIntegerBranchLengths(self,nsftree)
    Reformats any integer branch lengths to floats.
    >> nsftree:str = NSF tree
    << newtree:str = Returned tree with reformatted branchlengths
Tree.rootPath(self,node)
    Returns path to root or 'trichotomy'.
    << list of branch objects
Tree.saveTree(self,filename='None',type='nsf',seqnum=False,seqname='short',maxnamelen=123,blen='Length',bootstraps='boot',multiline=True)
    Saves tree to a file.
    >> filename:str ['None']
    >> type:str ['nsf']
    - 'nsf' = Newick Standard Format; 'text' = text; 'r' = for r graphic, 'png' = r png, 'te' = TreeExplorer NSF
    >> seqnum:boolean [False] = whether to print sequence numbers (from zero if bootstraps = 'num')
    >> seqname:str ['short'] = name to use for sequences
    - 'num' = numbers only; 'short' = short sequence names; 'long' = long names; 
    - whitespaces are replaced with underscores, brackets with '' and colons/semicolons with -
    >> maxnamelen:int [123] = truncate names to this length for compatability with other programs.
    >> blen:str ['Length'] = branch lengths 
    - 'none' = do not use branch lengths; 'fix' = fix at _deflen;
    - 'pam' = replace with PAM distances if calculated (else none)
    - other = use node.stat[blen]
    >> bootstraps:str ['boot'] = what to print for bootstraps
    - 'none' = nothing; 'boot' = boostraps if given (else none); 'node' = node numbers
    >> multiline:boolean = whether to spread file over multiple lines (1 per node)
Tree.saveTrees(self,seqname='long',blen='Length',bootstraps='boot')
    Generates all tree file formats selected in normal format.
    >> seqname:str ['long'] = name to use for sequences
    - 'num' = numbers only; 'short' = short sequence names; 'long' = long names; 
    - whitespaces are replaced with underscores, brackets with '' and colons/semicolons with -
    >> blen:str ['Length'] = branch lengths 
    - 'none' = do not use branch lengths; 'fix' = fix at _deflen;
    - 'pam' = replace with PAM distances if calculated (else none)
    - other = use node.stat[blen]
    >> bootstraps:str ['boot'] = what to print for bootstraps
    - 'none' = nothing; 'boot' = boostraps if given (else none); 'node' = node numbers
Tree.seqNum(self)


Tree.subfams(self)


Tree.sumTree(self)
    Prints Summary of Tree.
Tree.svgTree(self,filename,seqname='short',nodename='long',blen='Length',fromnode=None,compress=True,title=None,qry='qry',save=True,treesplit=0.5,font=12,maxfont=20,width=1600,height=0,xoffset=0,yoffset=0)
    Generates a data dictionary similar to that output to a file to be read in and processed by R for SVG generation.
    See rje_tree.dictTee() and rje_svg.svgTree() for parameters.
Tree.textTree(self,filename=None,seqnum=True,seqname='short',maxnamelen=50,nodename='num',showboot=True,showlen='branch',blen='Length',scale=-1,spacer=1,pause=50,fromnode=None,compress=True)
    Outputs tree as ASCII text.
    >> filename:str [None]
    >> seqnum:boolean [True] = whether to print sequence numbers (from zero if nodename <> 'none')
    >> seqname:str ['short'] = name to use for sequences
    - 'num' = numbers only; 'short' = short sequence names; 'long' = long names
    >> maxnamelen:int [50] = truncate names to this length 
    >> nodename:str ['num'] = how to label nodes
    - 'none' = no label, 'num' = numbers only, 'short' = short names, 'long' = long names
    >> showboot:boolean [True] = whether to show boostraps if given
    >> showlen:str ['branch'] = whether to show lengths of branches leading to nodes
    - 'none' = no lengths, 'branch' = branch lengths, 'sum' = summed length from root
    >> blen:str ['Length'] = branch lengths 
    - 'none' = do not use branch lengths; 'fix' = fix at _deflen;
    - 'pam' = replace with PAM distances if calculated (else none)
    - other = use node.stat[blen]
    >> scale:int [-1] = no. of characters per self._deflen distance
    >> spacer:int [1] = 'blank' rows between nodes
    >> pause:int [50] = Number of lines printed before pausing for user <ENTER>
    >> fromnode:Node Object [None] = draw subtree from Node
    >> compress:boolean = whether to compress compressed nodes
Tree.treeFromNSF(self,nsf_file)
    Reads from Newick Standard Format file and returns NSF tree.
    * NOTE: whitespace is removed - do not have whitespace in sequence names
    >> nsf_file:str = filename
    << nsftree:str
Tree.treeGroup(self,callmenu=False)
    See rje_tree_group.treeGroup().
Tree.treeRoot(self)
    Tree Rooting loop.
Tree.unRoot(self)
    'Unroots' the tree => Changes node number and branches.
Tree.upgmaIDTree(self,seqlist)
    Makes UPGMA tree based on MSA %ID.

rje_tree Module Methods

rje_tree.runMain()


rje_tree.treeMenu(out,mainlog,cmd_list,tree=None)
    Menu for handling Tree Functions in 'standalone' running.
    - load sequences
    - load tree
    - save tree
    - edit tree
    - define subfams
rje_tree.treeName(name,nospace=False)
    Reformats name to be OK in NSF file.

rje_tree Module ToDo Wishlist

    # [y] Interactive menus for module and major activities
    # [y] read in from files*
    # [y] re-root/unroot as desired* (midpoint/random/outgroup/manual)
    # [y] save to files
    # [y] output text tree
    # [y] prune (remove/compress leaves/clades)
    # [ ] Add option for combining sequences (into a consensus?) - manipulates treeseqs object?
    # [ ] .. Also replace clade with ancestral sequence (GASP)
    # [y] identify duplications
    # [y] establish subfamilies/groups
    # [y] .. group=X commandline option
    # [ ] .. Modify Load Groups to allow both pure and fuzzy mapping (allows for some homoplasy?)
    # [Y] make trees from sequences
    # [Y] - CW NJ tree (as HAQESAC)
    # [ ] - Add different formats of distance matrix for disin=FILE
    # [ ] - distance matrix methods in SeqList class
    # [ ] .. Add UPGMA generation from aligned sequences?
    # [y] Tidy and __doc__ all methods
    # [y] Load ancestral sequences
    # [Y] Add a scale commandline option for textTrees
    # [ ] Add amino acid probabilities to nodes (for anc seq reconstruction)
    # [ ] Add better Exception trapping and check use of interactive and verbose settings.
    # [ ] Add a treein=X argument - will determine format from extension
    # [ ] Improve handling of duplication calculation - store status in self.opt and change only if pruned or rooting changed
    # [ ] Upgrade AncMapSeq to match modified MapSeq
    # [ ] Give module complete facelift in line with other modules! (Use list and dict attribute dictionaries.)
    # [ ] Sort out problem of multiple dup finding
    # [ ] Add MrBayes for making trees
    # [ ] Phase out savetype in favour of treeformats

rje_tree_group [version 1.2] Contains all the Grouping Methods for rje_tree.py ~ [Top]

Module: rje_tree_group
Description: Contains all the Grouping Methods for rje_tree.py
Version: 1.2
Last Edit: 19/08/09
Imports: rje, rje_seq
Imported By: rje_tree
Copyright © 2005 Richard J. Edwards - See source code for GNU License Notice

Function:
This module is a stripped down template for methods only. This is for when a class has too many methods and becomes
untidy. In this case, methods can be moved into a methods module and 'self' replaced with the relevant object. For
this module, 'self' becomes '_tree'.

Commandline:
This module is not for standalone running and has no commandline options (including 'help'). All options are handled
by the parent module: rje_tree.py

Uses general modules: copy, re, os, string, sys
Uses RJE modules: rje, rje_seq
Other modules needed: rje_blast, rje_dismatrix, rje_pam, rje_sequence, rje_uniprot

rje_tree_group Module Methods

rje_tree_group._addGroup(_tree,node)
    Adds group based at node, removing existing descendant groups.
    >> node:Node Object
rje_tree_group._autoGroups(_tree,method='man')
    Automatically goes into grouping routines.
    >> method:str = grouping method
        - man = manual grouping (unless i<0).
        - dup = duplication (all species unless groupspec specified).
        - qry = duplication with species of Query sequence (or Sequence 1) of treeseq
        - one = all sequences in one group
        - None = no group (case sensitive)
        - FILE = load groups from file
rje_tree_group._checkGroupNames(_tree)
    Automatically names groups after given gene unless already renamed.
rje_tree_group._checkGroups(_tree)
    Checks that Group selection does not break 'rules':
    - enough groups
    - enough sequences in each group
    - sufficient bootstrap support for each group
    - no variants in group if allowvar=False
    - no orphans if orphans=False
    << True if OK, False if rule(s) broken.
rje_tree_group._clearGroups(_tree)
    Clears current group selection.
rje_tree_group._dupGroup(_tree)
    Duplication grouping.
rje_tree_group._groupChoice(_tree)
    Gives manual choices for grouping.
    << False if unchanged, True if changed
rje_tree_group._groupDisSum(_tree,seq1,seq2,text='')
    Prints MSA ID, MSA Gaps and MSA Extra Summary of seq1 vs seq2
    >> seq1:Sequence Object to be compared to...
    >> seq2:Sequence Object
    >> text:str  = vs text ['vs seq2.shortName()']
    << returns string
rje_tree_group._groupRules(_tree)
    Options to change grouping options.
rje_tree_group._grpSeqSort(_tree,seqs=[],compseq=None)
    Reorders seqs according to %ID, Gaps and Extra.
    >> seqs:list of sequences to reorder
    >> compseq:Sequence Object to compare seqlist to
    << reordered seqlist
rje_tree_group._grpVarDel(_tree,delseq=[],kept=None,fam=None)
    Deletes all variants in list.
    >> delseq:list of Sequences to delete
    >> kept:Sequence Object of kept variant
    >> fam:Node Object that defines subfam
rje_tree_group._loadGroups(_tree,filename='rje_tree.grp')
    Saves sequence names in Groups.
    >> filename:str = group filename
rje_tree_group._orphanCount(_tree)
    Returns number of orphan sequences.
rje_tree_group._purgeOrphans(_tree)
    Removes orphan nodes.
rje_tree_group._queryGroup(_tree)
    Returns Query subfam node or None if none.
rje_tree_group._reorderSeqToGroups(_tree)
    Reorders sequences according to groups.
rje_tree_group._resetGroups(_tree)
    Resets node compression to match current group selection.
rje_tree_group._reviewGroups(_tree,interactive=1)
    Summarise, scan for variants (same species), edit group.
rje_tree_group._saveGroups(_tree,filename='rje_tree.grp',groupnames=True)
    Saves sequence names in Groups.
    >> filename:str = group filename
    >> groupnames:boolean = whether to save groupnames
rje_tree_group._sumGroups(_tree)
    Prints summary of Groups.
rje_tree_group.treeGroup(_tree,callmenu=False)
    Master Tree Grouping loop. This method will:
    1. Attempt to automatically group the sequences according to the current grouping option.
    2. Assess the grouping returned and enter manual mode if:
        (a) _tree.stat['Interactive'] > 1
    -OR-(b) _tree.stat['Interactive'] == 0 and grouping is bad or callmenu is True

rje_uniprot [version 3.10] RJE Module to Handle Uniprot Files ~ [Top]

Module: rje_uniprot
Description: RJE Module to Handle Uniprot Files
Version: 3.10
Last Edit: 06/07/11
Imports: rje, rje_sequence, rje_zen
Imported By: happi, pingu, presto_V5, unifake, compass, rje_dbase, rje_phos, rje_yeast, slim_pickings, RankByDistribution, ned_SLiMPrints, ned_SLiMPrints_Tester, ned_rankbydistribution, rje_biogrid, rje_embl, rje_ensembl, rje_iridis, rje_motiflist, rje_seq, rje_slimhtml, slimjim, slimprints
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
This module contains methods for handling UniProt files, primarily in other rje modules but also with some
standalone functionality. To get the most out of the module with big UniProt files (such as the downloads from EBI),
first index the UniProt data using the rje_dbase module.

This module can be used to extract a list of UniProt entries from a larger database and/or to produce summary tables
from UniProt flat files.

In addition to method associated with the classes of this module, there are a number of methods that are called from
the rje_dbase module (primarily) to download and process the UniProt sequence database.

Input Options:
* unipath=PATH : Path to UniProt Datafile (will look here for DB Index file made with rje_dbase)
* dbindex=FILE : Database index file [uniprot.index]
* uniprot=FILE : Name of UniProt file [None]
* extract=LIST : Extract IDs/AccNums in list. LIST can be FILE or list of IDs/AccNums X,Y,.. []
* acclist=LIST : As Extract.
* specdat=LIST : Make a UniProt DAT file of the listed species from the index (over-rules extract=LIST) []
* splicevar=T/F : Whether to search for AccNum allowing for splice variants (AccNum-X) [True]
* tmconvert=T/F : Whether to convert TOPO_DOM features, using first description word as Type [False]
* fullref=T/F : Whether to store full Reference information in UniProt Entry objects [False]

Output Options:
* datout=FILE : Name of new (reduced) UniProt DAT file of extracted sequences [None]
* tabout=FILE : Table of extracted UniProt details [None]
* linkout=FILE : Table of extracted Database links [None]
* longlink=T/F : Whether link table is to be "long" (acc,db,dbacc) or "wide" (acc, dblinks) [True]
* ftout=FILE : Outputs table of features into FILE [None]
* domtable=T/F : Makes a table of domains from uniprot file [False]
* cc2ft=T/F : Extra whole-length features added for TISSUE and LOCATION (not in datout) [False]

UniProt Conversion Options:
* ucft=X : Feature to add for UpperCase portions of sequence []
* lcft=X : Feature to add for LowerCase portions of sequence []
* maskft=LIST : List of Features to mask out []
* invmask=T/F : Whether to invert the masking and only retain maskft features [False]
* caseft=LIST : List of Features to make upper case with rest of sequence lower case []

General Options:
* append=T/F : Append to results files rather than overwrite [False]
* memsaver=T/F : Memsaver option to save memory usage - does not retain entries in UniProt object [False]
* cleardata=T/F : Whether to clear unprocessed Entry data (True) or (False) retain in Entry & Sequence objects [True]

UniProt Download Processing Options:
* makeindex=T/F : Generate UniProt index files [False]
* makespec=T/F : Generate species table [False]
* makefas=T/F : Generate fasta files [False]
* grepdat=T/F : Whether to use GREP in attempt to speed up processing [False]

Uses general modules: glob, os, re, string, sys, time
Uses RJE modules: rje, rje_sequence

rje_uniprot Module Version History

    # 0.0 - Initial Compilation.
    # 1.0 - Initial working version for interaction_motifs.py
    # 1.1 - Minor tidying and modification
    # 2.0 - Moved functions to rje_dbase. Added option to extract using index files.
    # 2.1 - Added possibility to extract splice variants
    # 2.2 - Added output of feature table for the entries in memory (not compatible with memsaver mode)
    # 2.3 - Added ID to tabout and also added accShortName() method to extract dictionary of {acc:ID__PrimaryAcc}
    # 2.4 - Add method for converting Sequence object and dictionary into UniProt objects... and saving
    # 2.5 - Added cc2ft Extra whole-length features added for TISSUE and LOCATION [False] and ftout=FILE
    # 2.6 - Added features based on case of sequence. (Uses seq.dict['Case'])
    # 2.7 - Added masking of features - Entry.maskFT(type='EM',inverse=False)
    # 2.8 - Added making of Taxa-specific databases using a list of UniProt Species codes
    # 2.9 - Added extraction of EnsEMBL, HGNC, UniProt and EntrezGene from IPI DAT file.
    # 3.0 - Added some module-level methods for use with rje_dbase.
    # 3.1 - Added extra linking of databases from UniProt entries
    # 3.2 - Added feature masking and TM conversion.
    # 3.3 - Added DBase processing options.
    # 3.4 - Made modifications to allow extended EMBL functionality as part of rje_embl.
    # 3.5 - Added SplitOut to go with rje_embl V0.1
    # 3.6 - Added longlink=T/F  : Whether link table is to be "long" (acc,db,dbacc) or "wide" (acc, dblinks) [True]
    # 3.7 - Added cleardata=T/F : Whether to clear unprocessed Entry data or retain in Entry & Sequence objects [True]
    # 3.8 - Added extraction of NCBI Taxa ID.
    # 3.9 - Added grepdat=T/F     : Whether to use GREP in attempt to speed up processing [False]
    # 3.10- Added forking for speeding up of processing.
    # 3.11- Added storing of Reference information in UniProt entries.

UniProt Class

    UniProt Download Class. Author: Rich Edwards (2005).

    Info:str
    - Name = Name of UniProt File 
    - UniPath = Path to UniProt Datafile (will look here for DB Index file) [UniProt/]
    - DBIndex = Database index file [uniprot.index]
    - DATOut = Name of new (reduced) UniProt DAT file of extracted sequences [None]
    - TabOut = Name of table of extracted sequence details [None]
    - LinkOut = Table of extracted Database links [None]
    - FTOut = Outputs table of features into FILE [None]
    - SplitOut = If path given, will split output into individual files per entry into PATH []
    - UCFT = Feature to add for UpperCase portions of sequence []
    - LCFT = Feature to add for LowerCase portions of sequence []
    
    Opt:boolean
    - ClearData = Whether to clear unprocessed Entry data (True) or (False) retain in Entry & Sequence objects [True]
    - DomTable = Makes a table of domains from uniprot file [False]
    - FullRef = Whether to store full Reference information in UniProt Entry objects [False]
    - GrepDat = Whether to use GREP in attempt to speed up processing [False]
    - LongLink = Whether link table is to be "long" (acc,db,dbacc) or "wide" (acc, dblinks) [True]    
    - MakeIndex = Generate UniProt index files [False]
    - MakeSpec = Generate species table [False]
    - MakeFas = Generate fasta files [False]
    - SpliceVar = Whether to search for AccNum allowing for splice variants (AccNum-X) [True]

    Stat:numeric

    List:list
    - Entry = list of UniProt Entries
    - Extract = Extract AccNums/IDs in list. LIST can be FILE or list of AccNums X,Y,.. []
    - SpecDat = Make a UniProt DAT file of the listed species from the index []

    Dict:dictionary    

    Obj:RJE_Objects
UniProt._add_entry(self,_entry,acclist)
    Returns True if _entry in acclist or false if not.
    >> _entry:uniProtEntry object (unprocessed)
    >> acclist:list of accession numbers and/or IDs
UniProt._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
UniProt._newEntryObject(self)


UniProt._processWholeFile(self,filename,logft=True,cleardata=None,reformat=False,log=True)
    Processes whole file into entries.
    >> filename:str = UniProt filename
    >> logft:boolean = whether to write number of features to log file
    >> reformat:boolean = whether to save to DatOut using cut-down method
    << returns True/False dependent on success
UniProt._readSingleEntry(self,UNIPROT,logft=True,cleardata=None,reformat=False)
    Processes whole file into entries.
    >> UNIPROT:FileHandle = Open UniProt file for reading *at start of entry*
    >> logft:boolean = whether to write number of features to log file
    >> cleardata:boolean = whether to clear data to save memory after processing
    >> reformat:boolean = whether to save to DatOut using cut-down method
    << returns True/False dependent on success
UniProt._setAttributes(self)
    Sets Attributes of Object.
UniProt.accDict(self,acc_list=[],cleardata=None)
    Method to extract dictionaries of {acc:UniProtEntry}.
    >> acclist:list of accession numbers. Will use self.list['Extract'] if none given
    << dictionary of {acc:UniProtEntry}
UniProt.accNameSeq(self,acc_list=[],spec=None,justsequence=True)
    Method to extract dictionary of {acc:'ID__PrimaryAcc Desc'} & {acc:seq} using index.
    >> acclist:list of accession numbers. Will use self.list['Extract'] if none given
    >> spec:Limit to species code
    >> justsequence:bool [True] = Whether to just return the sequence data (True) or a sequence object (False)
    << tuple of dictionaries of ({acc:ID__PrimaryAcc},{acc:uniprot sequence})
UniProt.addFromSeq(self,seq=None,sequence='',name='',data={},ft=[])
    Converts into UniProtEntry object and adds to self.
    >> seq:rje_sequence.Sequence object [None]
    >> sequence:str = alternative sequence data (will be converted to Sequence object!) ['']
    >> name:str = alternative sequence name (will be converted to Sequence object!) ['']
    >> data:dict = dictionary of UniProt data with {keys ID/AC/OS etc: [list of lines]} [{}]
    >> ft:list = list of ftdic dictionaries of features {'Type/Desc':str,'Start/End':int} [[]]
    << returns entry if successful or None if fails
UniProt.domTable(self)
    Outputs domain info into a table.
UniProt.entries(self)


UniProt.entryNum(self)


UniProt.extractSpecies(self)
    Uses index file to convert species codes into list of Accession Numbers.
UniProt.ftTable(self,outfile)
    Outputs features into a table.
    >> outfile:str = Name of output file
UniProt.linkOutput(self)
    Delimited output of UniProt database links.
UniProt.linkOutputLong(self)
    Delimited output of UniProt database links.
UniProt.readUniProt(self,filename=None,clear=True,acclist=[],logft=False,use_index=True,cleardata=None,reformat=False)
    Reads UniProt download into UniProtEntry objects.
    >> filename:str = UniProt download filename [None]
    >> clear:boolean = Whether to clear self.list['Entry'] before reading [True]
    >> acclist:list of str objects = UniProt accnum or id list to read
    >> logft:boolean [False] = whether to write number of features to log file
    >> use_index:boolean [True] = whether to use index file if present
    >> cleardata:boolean [True] = whether to clear processed data to save memory 
    >> reformat:boolean = whether to save to DatOut using cut-down method
    << True if success, False if fail
UniProt.run(self)
    Main Run Method if called direct from commandline. Returns True if no Errors, else False.
UniProt.saveUniProt(self,outfile,entries=[],append=False)
    Saves self as a DAT file.
    >> outfile:str = Name of output file
    >> entries:list of entries (self.list['Entry'] if none given)
    >> append:boolean = whether to append file
UniProt.tableOutput(self)
    Tabulated output of UniProt information. Divided into TabOut (UniProt summary) and LinkOut (database links)

UniProtEntry Class

    UniProt Entry Class. Author: Rich Edwards (2005).

    Info:str
    - Name = UniProt ID of Entry
    - Type = Preliminary or Standard
    - FullText = Full Text of Entry
    
    Opt:boolean
    - CC2FT = Extra whole-length features added for TISSUE and LOCATION [False]
    - ClearData = Whether to clear unprocessed Entry data (True) or (False) retain in Entry & Sequence objects [True]
    - InvMask = Whether to invert the masking and only retain maskft features [False]    
    - FullRef = Whether to store full Reference information in UniProt Entry objects [False]
    - TMConvert = Whether to convert TOPO_DOM features, using first description word as Type [False]

    Stat:numeric
    - Length = Length of Sequence as annotated

    List:list
    - CaseFT = List of Features to make upper case with rest of sequence lower case []
    - Feature = List of feature dictionaries: [Type,Start,End,Desc]
    - MaskFT = List of Features to mask out []
    - PubMed = List of PubMed IDs (as strings)
    - Keywords = List of UniProt Keywords
    - Tissues = List of UniProt Tissues
    - References = List of reference dictionaries [RP,RC,RX,RA,RT,RL,RG]
    - Synonyms = List of Gene synonyms
    
    Dict:dictionary
    - Data = Dictionary of lists of UniProt data (Keys are line headers ID/AC/CC etc.)
    - DB = Specific extractions from DR lines for use in other programs. {DB:[AccNum/ID]}
    - Comments = Dictionary of comments: {Type:List of Comments}
    - DBLinks = List of Database Link dictionaries {Dbase,List of Details} for dblinks output

    Obj:RJE_Objects
    - Sequence = rje_sequence.Sequence object
UniProtEntry._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
UniProtEntry._setAttributes(self)
    Sets Attributes of Object.
UniProtEntry._uniParse(self,key)
    Parses list of elements from self.dict['Data'] (and pops).
    >> key:str = Key of UniProt entry type
    << List of matched elements or False if failure.
UniProtEntry.caseFT(self,types=[])
    Masks given feature types.
    >> types:list of str [] = types of feature to be upper case
UniProtEntry.cc2ft(self)
    Adds extra full-length features based on LOCATION and TISSUE.
UniProtEntry.isSpecies(self,spec=None,speclist=[])
    Returns True if entry corresponds to listed species (or species code).
    >> spec:str = single species that MUST be be right
    >> speclist:list = can match any species in list
UniProtEntry.maskFT(self,types=['EM'],inverse=False,mask='X',log=True)
    Masks given feature types.
    >> types:list of str [['EM']] = types of feature to mask
    >> inverse:bool [False] = whether to mask all sequence *except* listed types
    >> mask:str ['X'] = character to use for masking
    >> log:bool [True] = whether to log the affects of masking
UniProtEntry.orderFT(self)
    Orders features by start, end, type.
UniProtEntry.process(self,logft=True,cleardata=None)
    Extract Details from self.dict['Data'] to Sequence object.
    >> logft:boolean = whether to write number of features to log file
    >> cleardata=Whether to clear self.dict['Data'] after processing (to save memory) [True]
    << True if OK, False if not.
UniProtEntry.seqi(self,ikey)


UniProtEntry.shortName(self)


UniProtEntry.specialDB(self,dbase,details)
    Extracts specific information to self.dict['DB'].
    >> dbase:str = Database identifier extracted from DR line of DAT file - '^(\S+);\s+(\S.+)$'[0]
    >> details:str = Database links extracted from DR line of DAT file - '^(\S+);\s+(\S.+)$'[1]
UniProtEntry.uniProtFromSeq(self,seq=None,sequence='',name='',data={},ft=[])
    Converts into UniProtEntry object (self!).
    >> seq:rje_sequence.Sequence object [None]
    >> sequence:str = alternative sequence data (will be converted to Sequence object!) ['']
    >> name:str = alternative sequence name (will be converted to Sequence object!) ['']
    >> data:dict = dictionary of UniProt data with {keys ID/AC/OS etc: [list of lines]} [{}]
    >> ft:list = list of ftdic dictionaries of features {'Type/Desc':str,'Start/End':int} [[]]
    << returns self if successful or None if fails

rje_uniprot Module Methods

rje_uniprot.downloadUniProt(callobj)
    Downloads the UniProt database using the attributes of callobj.
rje_uniprot.forkProcess(callobj,alldatfiles,makeindex,makespec,makefas,forkbytes=1e8)
    Forks out DAT download processing to speed up (hopefully!).
rje_uniprot.processUniProt(callobj,makeindex=True,makespec=True,makefas=True,temp=False)
    Processes UniProt making index file and spectable as appropriate.
rje_uniprot.runMain()




rje_uniprot Module ToDo Wishlist

    # [ ] : Lots of functionality to add! Look also to BioPython.
    # [Y] : Modify the searching for entry in acclist to cope with partial matches (exclude them)
    # [ ] : Modify DomTable to work with Memsaver
    # [ ] : Add specific database detail extraction, first for IPI DAT files and later for UniProt
    # [ ] : Add a database mapping method for extracting DB cross-refs.
    # [ ] : Change the non-forking processing method to match forked one, which is generally faster!

RJE_XGMML [version 0.0] RJE XGMLL Module ~ [Top]

Module: RJE_XGMML
Description: RJE XGMLL Module
Version: 0.0
Last Edit: 14/11/07
Imports: rje, rje_zen
Imported By: comparimotif_V3, pingu, qslimfinder, slimfinder, rje_ppi, rje_slimhtml
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
This module is currently designed to store data for, and then output, an XGMML file for uploading into Cytoscape etc.
Future versions may incoporate the ability to read and manipulate existing XGMML files.

Commandline:
At present, all commands are handling by the class populating the XGMML object.

Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje
Other modules needed: None

rje_xgmml Module Version History

    # 0.0 - Initial Compilation.

XGMML Class

    XGMML Class. Author: Rich Edwards (2007).

    Info:str
    - Name = This is the ID used for the graph [RJE_XGMML]
    - Description = Description of network
    - Type = Type of data in network
    
    Opt:boolean

    Stat:numeric

    List:list

    Dict:dictionary
    - Edge = Dictionary of edges between nodes {Type:{(source,target):Attributes}}
    - EdgeAtt = Dictionary of edge attributes {Att:Type}
    - Node = Dictionary of Nodes to be output {Node:Attributes}
    - NodeAtt = Dictionary of node attributes {Att:Type}
    - NodePos = Dictionary of node positions {Node:[x,y]}

    Obj:RJE_Objects
XGMML._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
XGMML._setAttributes(self)
    Sets Attributes of Object.
XGMML.saveXGMML(self,filename=None,format='Cytoscape')
    Saves object data to file in XGMML format.
    >> filename:str [None] = Output file. Will use name.xgmml if None.
    >> format:str [Cytoscape] = Target for output file

rje_xgmml Module Methods

rje_xgmml.runMain()




rje_xgmml Module ToDo Wishlist

    # [ ] : Read XGMML back into object
    # [ ] : Read and convert other formats
    # [ ] : Read distance matrices, perform MDS and output?

rje_zen [version 1.0] Random Zen Wisdom Generator ~ [Top]

Module: rje_zen
Description: Random Zen Wisdom Generator
Version: 1.0
Last Edit: 15/04/08
Imports: rje
Imported By: aphid, budapest, comparimotif_V3, fiesta, gfessa, happi, haqesac, multihaq, picsi, pingu, slimmaker, slimsearch, unifake, file_monster, prodigis, rje_dbase, rje_itunes, rje_phos, rje_pydocs, rje_seqgen, rje_seqplot, rje_ssds, rje_yeast, sfmap2png, wormpump, bob, rje_biogrid, rje_codons, rje_db, rje_dismatrix_V2, rje_disorder, rje_embl, rje_ensembl, rje_genbank, rje_genecards, rje_genemap, rje_go, rje_hmm, rje_html, rje_iridis, rje_markov, rje_mascot, rje_omim, rje_paml, rje_ppi, rje_qsub, rje_seq, rje_seqlist, rje_slim, rje_slimcalc, rje_slimhtml, rje_slimlist, rje_svg, rje_tm, rje_uniprot, rje_xgmml, rje_xml, slimfrap, slimjim, zentest
Copyright © 2007 Richard J. Edwards - See source code for GNU License Notice

Function:
Generates random (probably nonsensical) Zen wisdoms. Just for fun.

Commandline:
* wisdoms=X : Number of Zen Wisdoms to return [10]

Uses general modules: copy, glob, os, string, sys, time
Uses RJE modules: rje
Other modules needed: None

rje_zen Module Version History

    # 0.0 - Initial Compilation.
    # 1.0 - Full working version with four zen types and reasonable vocabulary

Zen Class

    Zen Wisdom Class. Author: Rich Edwards (2005).

    Info:str
    
    Opt:boolean

    Stat:numeric
    - Wisdoms = Number of Zen Wisdoms to return [10]

    List:list

    Dict:dictionary    

    Obj:RJE_Objects
Zen._adjective(self,ztype='A')
    Returns a random adjective.
Zen._adverb(self,ztype='A')
    Returns a random linker.
Zen._cmdList(self)
    Sets attributes according to commandline parameters:
    - see .__doc__ or run with 'help' option
Zen._linker(self,ztype='A')
    Returns a random linker.
Zen._noun(self,ztype='A')
    Returns a random noun.
Zen._setAttributes(self)
    Sets Attributes of Object.
Zen._verb(self,ztype='A')
    Returns a random verb.
Zen._zenA(self)
    Generates Zen of the type "The WISE MAN X BUT THE WWW XXX Y"
Zen._zenB(self)
    Generates Zen of the type "It is better X when Y"
Zen._zenC(self)
    Generates Zen of the type "Doing X leads to Y"
Zen._zenD(self)
    One man's X is another man's Y.
Zen.run(self)
    Main Run method. Calls wisdom method X times.
Zen.wisdom(self,type=None)
    Generates and returns a random Zen wisdom

rje_zen Module Methods

rje_zen.runMain()




rje_zen Module ToDo Wishlist

    # [ ] : Add reading of vocabulary from files?

© RJ Edwards 2012. Last modified 16 Apr 2012.