Inherits MGT::Config::MGTOptions.
List of all members.
Detailed Description
Collects data about taxonomically known sequence for training the classifier.
Process the NCBI BLAST DB files by calling fastacmd.
Aggregate, prioritize and partially remove redundancy along the taxonomy ids.
Member Function Documentation
def MGT::CollectTaxa::TaxaCollector::createFullTextIndexSeqHeader |
( |
|
self ) |
|
Build a full text index for sequence FASTA hdeaders in 'seq_hdr' table (currently implemented only for MySQL back-end).
Warning: it took 1 hr for 28M records.
def MGT::CollectTaxa::TaxaCollector::delDuplicateGiFromSeq |
( |
|
self ) |
|
Delete duplicate records (by gi) from seq table, leave only those with smallest gi.
Duplicate records appear because some sequences (with identical defline) are included
into both 'refseq_genomic' and 'other_genomic' BLAST databases.
def MGT::CollectTaxa::TaxaCollector::excludePostSource |
( |
|
self ) |
|
Exclude some sequence after selectTaxSource().
@post table act_seq is mapping into seq and act_src
@post table act_src is all_src with some records
dropped and seq_len recomputed from seq_src.
The combination of these two new tables allows to
filter initial sequence set described by all_src
both at individual sequence level (e.g. drop short sequences)
and at source id level (meaning origin,sequence type)
(e.g. retain only longest RefSeq strain for each species).
Currently this method does not drop any sequence.
After this method, act_seq and act_src can be used
to randomly sample for training and testing sets.
def MGT::CollectTaxa::TaxaCollector::excludePreSource |
( |
|
self ) |
|
Exclude some sequence before selectTaxSource()
@post table seq_excl has excluded idseq's.
We exclude records that are likely to be outliers in composition:
rRNA genes, plasmids and genomic islands.
Then we exclude all NT records that are not complete genome or complete chromosome.
def MGT::CollectTaxa::TaxaCollector::exportIdsForSeqDb |
( |
|
self, |
|
|
|
seqTable |
|
) |
| |
Write text files with sequence ids that will be used to create sequence HDF5 files.
Ids are extracted from act_seq table.
Currently we load only NCBI sequence, but we want to decouple SeqDB from NCBI GIs
in case we will also use non-NCBI sequence in the future. Therefore, we write
a file with (GI,ID) pairs where ID is our internal sequence id. The order is defined by
our tree nested set index. That most closely corresponds to the order in which
we will be traversing the tree most of the time.
def MGT::CollectTaxa::TaxaCollector::indexHdfActiveSeq |
( |
|
self ) |
|
Create HDF index dataset for active sequence (from act_seq table).
def MGT::CollectTaxa::TaxaCollector::indexHdfSeq |
( |
|
self ) |
|
Create SQL tables that index sequence in HDF dataset.
def MGT::CollectTaxa::TaxaCollector::loadSeqToHdf |
( |
|
self ) |
|
Load sequence data for all records in 'seq' table into HDF dataset.
def MGT::CollectTaxa::TaxaCollector::loadTaxNames |
( |
|
self ) |
|
We load only 'scientific name' entries.
def MGT::CollectTaxa::TaxaCollector::loadTaxTables |
( |
|
self ) |
|
Create and fill all initial tables with taxa tree data.
It only needs NCBI taxonomy dump files.
It assigns dummy zero values to seq_len and seq_len_total attributes.
Result: original tree dumf files as well as our tree object with nested set index etc
is saved in SQL DB
def MGT::CollectTaxa::TaxaCollector::reportStat |
( |
|
self ) |
|
Output a report that shows a high-level overview of data.
def MGT::CollectTaxa::TaxaCollector::selectActiveSeq |
( |
|
self ) |
|
Select 'active' source sequence set - the one that can be used for training.
@post tables act_seq and act_src contain usable training sequence.
def MGT::CollectTaxa::TaxaCollector::selectSeqBySource |
( |
|
self ) |
|
For each taxid, group available sequence by its source (refseq, wgs, htgs, nt) and select groups by priority.
@post table all_src that has all selected groups and synthetic primary key 'id' for each group.
Member Data Documentation
We exclude records that are in subtrees of taxidDrop taxids, or have undefined taxonomy fields or with divid from dividDrop list.
The documentation for this class was generated from the following file:
- mgtaxa/MGT/CollectTaxa.py