List of all members.
Detailed Description
App-derived class for Horizontal chromosome transfer test
Member Function Documentation
def MGT::Proj::HctApp::HctApp::cvMatToPhylipMat |
( |
|
self ) |
|
Rename sequence names in the distance matrix into Phylip compatible names
def MGT::Proj::HctApp::HctApp::delDbSql |
( |
|
self ) |
|
Call this to free a connection to SQL server if it will not be needed for extended period of time
def MGT::Proj::HctApp::HctApp::doWork |
( |
|
self, |
|
|
|
kw |
|
) |
| |
Do the actual work.
Must be redefined in the derived classes.
Should not be called directly by the user except from doWork() in a derived class.
Should work with empty keyword dict, using only self.opt.
If doing batch submision of other App instances, must return a list of sink (final) BatchJob objects.
Reimplemented from MGT::App::App.
def MGT::Proj::HctApp::HctApp::exportTreeDynAnnot |
( |
|
self ) |
|
Create a file in TreeDyn annotation format.
Example of such file format (taken from TreeDyn web site):
BUD2 Subcellular_loc { Bud_neck Cytoskeletal } Cellular_Role { Cell_polarity } FuncCat { GTPase_activating_protein }
BUD3 Subcellular_loc { Bud_neck } Cellular_Role { Cell_polarity } FuncCat { Unknown }
def MGT::Proj::HctApp::HctApp::getDbSql |
( |
|
self ) |
|
Allocate (if necessary) and return a connection to SQL server
def MGT::Proj::HctApp::HctApp::loadAccToNameMem |
( |
|
self ) |
|
Load dict(acc->name) from SQL table
def MGT::Proj::HctApp::HctApp::loadCvTreeNamesFile |
( |
|
self ) |
|
Load names from a name file
def MGT::Proj::HctApp::HctApp::loadGenomicIdsOrgType |
( |
|
self, |
|
|
|
db, |
|
|
|
idGenSeq, |
|
|
|
inserterSeq, |
|
|
|
orgType |
|
) |
| |
Load Fasta deflines for genomic sequences, index them by accession, and drop non-NC_ and all plasmids.
@todo It might be more robust to parse the GenBank file, e.g.
FEATURES Location/Qualifiers
source 1..208369
/organism="Bacillus cereus ATCC 10987"
/mol_type="genomic DNA"
/strain="ATCC 10987"
/db_xref="ATCC:10987"
/db_xref="taxon:222523"
/plasmid="pBc10987"
gene join(207497..208369,1..687)
We would have to fix the gap(unk100) bug first, and also check how the "extrachromosomal" is labeled
in GB file.
def MGT::Proj::HctApp::HctApp::loadProtIdsOrgType |
( |
|
self, |
|
|
|
db, |
|
|
|
idGenSeq, |
|
|
|
inserterSeq, |
|
|
|
orgType |
|
) |
| |
Load Fasta deflines for protein sequences generated by pullSeq, parse and insert them into SQL table.
def MGT::Proj::HctApp::HctApp::loadSelProtIdsMem |
( |
|
self ) |
|
Return as Numpy array table created by subSampleProtIds()
def MGT::Proj::HctApp::HctApp::makeCvTreeSeqInput |
( |
|
self, |
|
|
|
accProts |
|
) |
| |
Create input FASTA files for CVTree by selecting only sequences with Acc from accProts.
Warning: this will erase existing content of the target directory.
def MGT::Proj::HctApp::HctApp::makeSeqGenNames |
( |
|
self ) |
|
For each genetic element, we generate an artificial mnemonic name.
The purpose of generating a new name is two-fold:
1. Phylip cannot handle names longer than 10 symbols
2. For each organism with multiple genetic elements, our name will reflect the
nature of that element (chromosome, plasmid,...), as well as number elements of
the same nature in the order of reducing size.
Examples:
If the organism is assigned id 123 and it has only one chromosome,
that chromosome will be called t0123.
If that organism has two chromosomes, they will be called t0123C1, and t0123C2,
with C1 being the larger one.
The disparity between one- and multi-chromosome naming schemes is designed to make
multi-chromosomal organisms to visibly stand out.
For plasmids, names will be like t0123P1.
For all other genetic elements - like t0123?1 where ? is the first letter of
genel field of seq_gen table.
The names are saved into an SQL table name_gen.
def MGT::Proj::HctApp::HctApp::parseCmdLinePost |
( |
|
klass, |
|
|
|
options, |
|
|
|
args, |
|
|
|
parser |
|
) |
| |
Optionally modify options and args in-place.
Called at the end of parseCmdLine to allow the derived classes customizing the option processing.
@param options options returned by OptionParser and converted to Struct object
@param args args returned by OptionParser
@param parser OptionParser object used to parse the command line - needed here to call its error() method
if necessary.
options should be modified in place by this method
Reimplemented from MGT::App::App.
def MGT::Proj::HctApp::HctApp::pullGenBankProts |
( |
|
self, |
|
|
|
orgType, |
|
|
|
idsGen |
|
) |
| |
Scan a set of GenBank protein files and output a multi-FASTA file with each protein as a separate record and chromosome encoded in defline.
The defline will look like:
>YP_089573|221988_NC_006300.1
as in ><protein_acc>|<taxid>_<chrmosome_acc>
def MGT::Proj::HctApp::HctApp::pullGenBankProtsCatOld |
( |
|
self ) |
|
Scan a set of GenBank protein files and output a multi-FASTA file with all proteins in one chromosome concatenated.
@todo We cannot insert spacers between proteins, because CvTree will not care. So we get chimeric k-mers.
def MGT::Proj::HctApp::HctApp::pullOutGroupProts |
( |
|
self ) |
|
Convert RefSeq per-chromosome .faa file into .faa file with our defline format.
The defline will look like:
>YP_089573|221988_NC_006300.1
as in ><protein_acc>|<taxid>_<chrmosome_acc>
def MGT::Proj::HctApp::HctApp::pullSeq |
( |
|
self ) |
|
Convert original RefSeq Fasta and GenBank files into SQL tables and protein Fasta files
def MGT::Proj::HctApp::HctApp::runCvTree |
( |
|
self ) |
|
Run cvtree executable that generates the composition vectors.
Warning: this clears existing content of cvVecDir
def MGT::Proj::HctApp::HctApp::runCvTreeMat |
( |
|
self ) |
|
Run cvtree script that builds the distance matrix from precomputed composition vectors.
def MGT::Proj::HctApp::HctApp::selectAnyGenElement |
( |
|
self, |
|
|
|
minProtSeqLen |
|
) |
| |
Select genomic elements of any type longer than cutoff value for the distance matrix calculation
def MGT::Proj::HctApp::HctApp::selectChromosomes |
( |
|
self, |
|
|
|
minProtSeqLen |
|
) |
| |
Select chromosomes subset for the distance matrix calculation
def MGT::Proj::HctApp::HctApp::sqlPostPullSeq |
( |
|
self ) |
|
Create various derived tables after pulling sequences
def MGT::Proj::HctApp::HctApp::subSampleProtIds |
( |
|
self, |
|
|
|
maxProtSeqLen |
|
) |
| |
Select a subset of proteins from each genomic sequence constrained by a sum of protein lengths.
The use case: we want to have each chromosome represented by an equal length sample, to make
sure that the phylogenetic tree we build is not influenced by length differences.
The documentation for this class was generated from the following file:
- mgtaxa/MGT/Proj/HctApp.py