Inheritance diagram for MGT::Proj::HctApp::HctApp:

Public Member Functions
def	parseCmdLinePost
def	getDbSql
def	delDbSql
def	doWork
def	pullSeq
def	loadGenomicIdsOrgType
def	pullGenBankProts
def	pullOutGroupProts
def	pullGenBankProtsCatOld
def	loadProtIdsOrgType
def	sqlPostPullSeq
def	makeSeqGenNames
def	selectChromosomes
def	selectAnyGenElement
def	runCvTree
def	runCvTreeMat
def	subSampleProtIds
def	loadSelProtIdsMem
def	makeCvTreeSeqInput
def	loadCvTreeNamesFile
def	cvMatToPhylipMat
def	loadAccToNameMem
def	exportTreeDynAnnot

Detailed Description

App-derived class for Horizontal chromosome transfer test

Member Function Documentation

def MGT::Proj::HctApp::HctApp::cvMatToPhylipMat ( self )

Rename sequence names in the distance matrix into Phylip compatible names

def MGT::Proj::HctApp::HctApp::delDbSql ( self )

Call this to free a connection to SQL server if it will not be needed for extended period of time

def MGT::Proj::HctApp::HctApp::doWork	(	self,
		kw
	)

Do the actual work.
Must be redefined in the derived classes.
Should not be called directly by the user except from doWork() in a derived class.
Should work with empty keyword dict, using only self.opt.
If doing batch submision of other App instances, must return a list of sink (final) BatchJob objects.

Reimplemented from MGT::App::App.

def MGT::Proj::HctApp::HctApp::exportTreeDynAnnot ( self )

Create a file in TreeDyn annotation format.
Example of such file format (taken from TreeDyn web site):
BUD2  Subcellular_loc { Bud_neck Cytoskeletal } Cellular_Role { Cell_polarity } FuncCat { GTPase_activating_protein }
BUD3  Subcellular_loc { Bud_neck } Cellular_Role { Cell_polarity } FuncCat { Unknown }

def MGT::Proj::HctApp::HctApp::getDbSql ( self )

Allocate (if necessary) and return a connection to SQL server

def MGT::Proj::HctApp::HctApp::loadAccToNameMem ( self )

Load dict(acc->name) from SQL table

def MGT::Proj::HctApp::HctApp::loadCvTreeNamesFile ( self )

Load names from a name file

def MGT::Proj::HctApp::HctApp::loadGenomicIdsOrgType	(	self,
		db,
		idGenSeq,
		inserterSeq,
		orgType
	)

Load Fasta deflines for genomic sequences, index them by accession, and drop non-NC_ and all plasmids.
@todo It might be more robust to parse the GenBank file, e.g.
FEATURES             Location/Qualifiers
source          1..208369
    /organism="Bacillus cereus ATCC 10987"
    /mol_type="genomic DNA"
    /strain="ATCC 10987"
    /db_xref="ATCC:10987"
    /db_xref="taxon:222523"
    /plasmid="pBc10987"
                                                                                                           gene            join(207497..208369,1..687)

We would have to fix the gap(unk100) bug first, and also check how the "extrachromosomal" is labeled
in GB file.

def MGT::Proj::HctApp::HctApp::loadProtIdsOrgType	(	self,
		db,
		idGenSeq,
		inserterSeq,
		orgType
	)

Load Fasta deflines for protein sequences generated by pullSeq, parse and insert them into SQL table.

def MGT::Proj::HctApp::HctApp::loadSelProtIdsMem ( self )

Return as Numpy array table created by subSampleProtIds()

def MGT::Proj::HctApp::HctApp::makeCvTreeSeqInput	(	self,
		accProts
	)

Create input FASTA files for CVTree by selecting only sequences with Acc from accProts.
Warning: this will erase existing content of the target directory.

def MGT::Proj::HctApp::HctApp::makeSeqGenNames ( self )

For each genetic element, we generate an artificial mnemonic name.
The purpose of generating a new name is two-fold:
1. Phylip cannot handle names longer than 10 symbols
2. For each organism with multiple genetic elements, our name will reflect the
nature of that element (chromosome, plasmid,...), as well as number elements of
the same nature in the order of reducing size.
Examples:
If the organism is assigned id 123 and it has only one chromosome,
that chromosome will be called t0123.
If that organism has two chromosomes, they will be called t0123C1, and t0123C2,
with C1 being the larger one.
The disparity between one- and multi-chromosome naming schemes is designed to make
multi-chromosomal organisms to visibly stand out.
For plasmids, names will be like t0123P1.
For all other genetic elements - like t0123?1 where ? is the first letter of 
genel field of seq_gen table.
The names are saved into an SQL table name_gen.

def MGT::Proj::HctApp::HctApp::parseCmdLinePost	(	klass,
		options,
		args,
		parser
	)

Optionally modify options and args in-place.
Called at the end of parseCmdLine to allow the derived classes customizing the option processing.
@param options options returned by OptionParser and converted to Struct object
@param args args returned by OptionParser
@param parser OptionParser object used to parse the command line - needed here to call its error() method
if necessary.
options should be modified in place by this method

Reimplemented from MGT::App::App.

def MGT::Proj::HctApp::HctApp::pullGenBankProts	(	self,
		orgType,
		idsGen
	)

Scan a set of GenBank protein files and output a multi-FASTA file with each protein as a separate record and chromosome encoded in defline.
The defline will look like:
>YP_089573|221988_NC_006300.1
as in ><protein_acc>|<taxid>_<chrmosome_acc>

def MGT::Proj::HctApp::HctApp::pullGenBankProtsCatOld ( self )

Scan a set of GenBank protein files and output a multi-FASTA file with all proteins in one chromosome concatenated.
@todo We cannot insert spacers between proteins, because CvTree will not care. So we get chimeric k-mers.

def MGT::Proj::HctApp::HctApp::pullOutGroupProts ( self )

Convert RefSeq per-chromosome .faa file into .faa file with our defline format.
The defline will look like:
>YP_089573|221988_NC_006300.1
as in ><protein_acc>|<taxid>_<chrmosome_acc>

def MGT::Proj::HctApp::HctApp::pullSeq ( self )

Convert original RefSeq Fasta and GenBank files into SQL tables and protein Fasta files

def MGT::Proj::HctApp::HctApp::runCvTree ( self )

Run cvtree executable that generates the composition vectors.
Warning: this clears existing content of cvVecDir

def MGT::Proj::HctApp::HctApp::runCvTreeMat ( self )

Run cvtree script that builds the distance matrix from precomputed composition vectors.

def MGT::Proj::HctApp::HctApp::selectAnyGenElement	(	self,
		minProtSeqLen
	)

Select genomic elements of any type longer than cutoff value for the distance matrix calculation

def MGT::Proj::HctApp::HctApp::selectChromosomes	(	self,
		minProtSeqLen
	)

Select chromosomes subset for the distance matrix calculation

def MGT::Proj::HctApp::HctApp::sqlPostPullSeq ( self )

Create various derived tables after pulling sequences

def MGT::Proj::HctApp::HctApp::subSampleProtIds	(	self,
		maxProtSeqLen
	)

Select a subset of proteins from each genomic sequence constrained by a sum of protein lengths.
The use case: we want to have each chromosome represented by an equal length sample, to make
sure that the phylogenetic tree we build is not influenced by length differences.

The documentation for this class was generated from the following file:

mgtaxa/MGT/Proj/HctApp.py

MGT::Proj::HctApp::HctApp Class Reference

Public Member Functions

Detailed Description

Member Function Documentation