Class that computes k-mer counts of incoming nucleotide sequence. More...
#include <kmers.hpp>
Public Member Functions | |
KmerCounter (int kmerLen, const AbcConvCharToInt *pAbcConv=0, RC_POLICY revCompPolicy=RC_MERGE, KmerId firstIdState=1) | |
A constructor. | |
void | doCNuc (CNuc cnuc) |
This method is called to process the input sequence. | |
void | doINuc (INuc inuc) |
This method is called for each element of an input sequence (through KmerCounter::doCNuc() wrapper method). | |
int | maxNumKmers (ULong seqLen) const |
Return the maximum number of unique k-mers that can be found in a sequence of a given length. | |
KmerId | getFirstIdState () const |
Return first state ID (for non-degenerate k-mers). | |
KmerId | getLastIdState () const |
Return last used plus one state ID (for non-degenerate k-mers). | |
int | getNumIds () const |
Get the total number of different IDs ("number of features") - depends on the RC_POLICY. | |
KmerStates & | getStates () |
Get reference to internal KmerStates array. | |
Interface to extract the results. | |
The following set of methods defines the result extraction protocol. They must be called in a strict order because they change the internal state of KmerCounter object. The interface is designed for efficiency and flexibility of the caller. After doCNuc() has been called any number of times (we call this "accumulation cycle"), the extraction is done as in the following code sample, after which doCNuc() can be called again to accumulate new counts. int n = o.numKmers(); o.startKmer(); for(int i = 0; i < n; i++,o.nextKmer()) { cout << o.getKmerId() << ":" << o.getKmerCount() << "\n"; } o.finishKmer(); nextKmer() can be called less than numKmers() times. sumKmerCounts() can be called outside of startKmer()...finishKmer() block | |
int | numKmers () const |
Return number of k-mers found so far in current accumulation cycle. | |
ULong | sumKmerCounts () const |
Return sum of non-degenerate k-mer counts found so far in current accumulation cycle. | |
ULong | sumDegenKmerCounts () const |
Return sum of degenerate k-mer counts found so far in current accumulation cycle. | |
void | startKmer (bool doSort=true) |
Prepare internal state for result extraction. | |
void | nextKmer () |
Advance internal state to extract next k-mer results. | |
ULong | getKmerCount () const |
Accessor to get count value from the currently extracted k-mer. | |
int | getKmerId () const |
Accessor to get Id from the currently extracted k-mer. | |
std::string | getKmerStr () const |
Accessor to get k-mer string (such as 'ACCCT') for the currently extracted k-mer. | |
Ind | indState () const |
Accessor to get index of the state for the currently extracted k-mer. | |
const PKmerState | getState () const |
Accessor to get pointer to the state for the currently extracted k-mer. | |
void | finishKmer () |
Finalize result extraction cycle - new series of doCNuc() calls can be done afterwards. | |
Protected Attributes | |
RC_POLICY | m_revCompPolicy |
How reverse complements are treated - one of RC_XXX. | |
std::vector< KmerStateData > | m_data |
Preallocated array of KmerStateData objects. | |
KmerStateData | m_dataDegen |
One dummy KmerStateData object that is linked to the zero state, which in turn is set as a reverse-complement one for all other degenerate states. | |
int | m_iDataEnd |
Index of the first unused element in m_data. | |
int | m_iDataExtr |
Index of m_data element that is currently being extracted. |
Class that computes k-mer counts of incoming nucleotide sequence.
It processes the input sequence and extracts the counts. Internally, it maintains the KmerStateData payload data and moves through states of the KmerStates state machine in response to incoming nucleotides.
MGT::KmerCounter::KmerCounter | ( | int | kmerLen, |
const AbcConvCharToInt * | pAbcConv = 0 , |
||
RC_POLICY | revCompPolicy = RC_MERGE , |
||
KmerId | firstIdState = 1 |
||
) |
A constructor.
kmerLen | is a length of a k-mer. In the current implementation, all kmers are precalculated and stored in memory, so be reasonable with this parameter. |
pAbcConv | is a to AbcConvCharToInt alphabet convertor (stored inside this KmerCounter object but not managed). |
revCompPolicy | what to do about reverse complement k-mers |
firstIdState | Start k-mer IDs from this value (default 1 as in SVMLight) |
void MGT::KmerCounter::doCNuc | ( | CNuc | cnuc ) | [inline] |
This method is called to process the input sequence.
Series of calls to this method are interleaved with calls to result extraction methods.
cnuc | one nucleotide character value (such as 'A') |
void MGT::KmerCounter::finishKmer | ( | ) | [inline] |
Finalize result extraction cycle - new series of doCNuc() calls can be done afterwards.
KmerId MGT::KmerCounter::getFirstIdState | ( | ) | const [inline] |
Return first state ID (for non-degenerate k-mers).
ULong MGT::KmerCounter::getKmerCount | ( | ) | const [inline] |
Accessor to get count value from the currently extracted k-mer.
int MGT::KmerCounter::getKmerId | ( | ) | const [inline] |
Accessor to get Id from the currently extracted k-mer.
std::string MGT::KmerCounter::getKmerStr | ( | ) | const [inline] |
Accessor to get k-mer string (such as 'ACCCT') for the currently extracted k-mer.
KmerId MGT::KmerCounter::getLastIdState | ( | ) | const [inline] |
Return last used plus one state ID (for non-degenerate k-mers).
int MGT::KmerCounter::getNumIds | ( | ) | const [inline] |
Get the total number of different IDs ("number of features") - depends on the RC_POLICY.
const PKmerState MGT::KmerCounter::getState | ( | ) | const [inline] |
Accessor to get pointer to the state for the currently extracted k-mer.
Should be used only by implementation-aware code such as KmerCounterLadder
KmerStates& MGT::KmerCounter::getStates | ( | ) | [inline] |
Get reference to internal KmerStates array.
Declared public only to be used by implementation-aware code such as KmerCounterLadder.
Ind MGT::KmerCounter::indState | ( | ) | const [inline] |
Accessor to get index of the state for the currently extracted k-mer.
Should be used only by implementation-aware code such as KmerCounterLadder
int MGT::KmerCounter::maxNumKmers | ( | ULong | seqLen ) | const [inline] |
Return the maximum number of unique k-mers that can be found in a sequence of a given length.
seqLen | sequence length |
void MGT::KmerCounter::nextKmer | ( | ) | [inline] |
Advance internal state to extract next k-mer results.
int MGT::KmerCounter::numKmers | ( | ) | const [inline] |
Return number of k-mers found so far in current accumulation cycle.
void MGT::KmerCounter::startKmer | ( | bool | doSort = true ) |
[inline] |
Prepare internal state for result extraction.
doSort | - if true, the results will be sorted (complexity will be N*log(N) where N is numKmers(). SVM sparse feature vector representation needs sorted results. |
ULong MGT::KmerCounter::sumDegenKmerCounts | ( | ) | const [inline] |
Return sum of degenerate k-mer counts found so far in current accumulation cycle.
Can be only called outside of startKmer()...finishKmer() block. Complexity: constant time
ULong MGT::KmerCounter::sumKmerCounts | ( | ) | const [inline] |
Return sum of non-degenerate k-mer counts found so far in current accumulation cycle.
Can be only called outside of startKmer()...finishKmer() block. Complexity: linear in the number of found km-mers.
std::vector<KmerStateData> MGT::KmerCounter::m_data [protected] |
Preallocated array of KmerStateData objects.
Array size is equal to the total number of states. This guarantees that there is always enough data elements regardless of the input sequence length.
KmerStateData MGT::KmerCounter::m_dataDegen [protected] |
One dummy KmerStateData object that is linked to the zero state, which in turn is set as a reverse-complement one for all other degenerate states.
This way, it serves as a sink counter for all degenerate states. That in turn removes one branch condition from the time-critical code in doINuc(). We also query from it the total count of degenerate states in the last accumulation cycle.
int MGT::KmerCounter::m_iDataEnd [protected] |
Index of the first unused element in m_data.
int MGT::KmerCounter::m_iDataExtr [protected] |
Index of m_data element that is currently being extracted.
RC_POLICY MGT::KmerCounter::m_revCompPolicy [protected] |
How reverse complements are treated - one of RC_XXX.