State machine for k-mers along with a corresponding payload object implemented in KmerStates and KmerCounter classes respectively. More...
#include <kmers.hpp>
Public Member Functions | |
KmerStates (int kmerLen, const AbcConvCharToInt *pAbcConv, RC_POLICY revCompPolicy=RC_MERGE, KmerId firstIdState=1) | |
Constructor. | |
PKmerState | nextState (PKmerState pState, INuc c) |
Return pointer to the next KmerState in the state machine. | |
PKmerState | revCompState (PKmerState pState) |
Return pointer to the reverse complement KmerState. | |
PKmerState | revCompStateFirst (PKmerState pState) |
Return the state that we marked as "first" in a pair of mutually reverse-complement states. | |
bool | isRevComp (PKmerState pState) const |
Return true if this state is marked as reverse-complement. | |
bool | isDegenState (PKmerState pState) const |
Return true if a given state corresponds to a degenerate k-mer. | |
PKmerState | firstState () |
Return pointer to the very first KmerState in the state machine. | |
const PKmerState | firstState () const |
Return constant pointer to the very first KmerState in the state machine. | |
int | numStates () const |
Return total number of KmerState states in the state machine (normal and degenerate). | |
KmerId | idState (PKmerState pState) const |
Return unique ID of the state's k-mer. | |
PKmerStateData | getData (PKmerState pState) |
Return pointer to KmerStateData object linked to the given state. | |
PKmerStateData | getData (Ind ind) |
Return pointer to KmerStateData object linked to the state with a given index. | |
void | setData (PKmerState pState, PKmerStateData pStateData) |
Link given KmerState object to the given KmerStateData object. | |
const Kmer & | kmerState (PKmerState pState) const |
Return reference to Kmer object for a given KmerState object. | |
Ind | indState (const PKmerState pState) const |
Return index of a given KmerState object that can be used to access this object later with getState(). | |
const PKmerState | getState (Ind ind) const |
Return pointer to a KmerState object given its index obtained earlier with indState(). | |
const Kmer & | getKmer (Ind ind) const |
Return reference to a Kmer object given its index obtained earlier with indState(). | |
KmerId | getFirstIdState () const |
Return first state ID (for non-degenerate k-mers). | |
KmerId | getLastIdState () const |
Return last used plus one state ID (for non-degenerate k-mers). | |
int | getNumIds () const |
Get the total number of different IDs ("number of features") - depends on the RC_POLICY. | |
std::ostream & | print (std::ostream &out) const |
Print KmerStates object for debugging. | |
Ind | kmerToIndex (const Kmer &kmer) const |
Return position of Kmer object in internal array. | |
Protected Member Functions | |
PKmerState | kmerToState (const Kmer &kmer) const |
Return pointer to KmerState for a given Kmer object. | |
void | nextKmer (const Kmer &currKmer, INuc c, Kmer &nextKmer) const |
Fill next Kmer object based on current Kmer and incoming nucleotide code. | |
void | initAllKmers () |
Initialize all KmerState and Kmer objects. | |
void | initRevCompl () |
Initialize reverse-complement links in KmerState objects. | |
Protected Attributes | |
int | iFirstStateNonDegen |
Index of the first non-degenerate state in m_states. | |
KmerId | m_firstIdState |
IDs of output (non-degenerate, non-reverse-complement) k-mers fill a half open interval [m_firstIdState,m_lastIdState). |
State machine for k-mers along with a corresponding payload object implemented in KmerStates and KmerCounter classes respectively.
The current implementation is optimized for the following use case:
In view of 2, we will keep the "payload" for each state node separately, referenced by the pointer to an instance of KmerStateData class. The payload object will contain the back reference to its state and the count, and the payloads can be allocated as needed, from a pre-allocated array. The size of payload array is bound by the max chunk length x 2 (direct plus reverse-complement sequence lengths), or the total number of possible k-mers for a given k, whichever is less. Therefore, we pre-allocate storage for the payload array of with the total number of possible k-mers.
The current implementation uses dense k-mer state machine (all possible k-mers are preallocated and states are initialized) with sparse payload: all states have initially their payloads set to NULL. As we stream through the input sequence, we move from the current state to the next one. At each step, we take the first of the reverse complement pair of states, and check if its payload is NULL. If it is, we allocate the new payload object. Then, we increment the count in the payload object.
Any implementation of KmerStates must guarantee this: All degenerate states point to the very first (zero) state as their reverse-complement state. That allows KmerCounter to link zero state to a single dummy KmerStateData object. Thus, reverse-complement link is not symmetric for degenerate states.
When counts are to be extracted, we have to loop only through the payload objects for those k-mers that were actually seen and represent the non-redundant subset of all reverse-complement pairs.
After construction, this implementation guarantees the following complexity during the sequence processing: O(1) on k-mer length; O(min(L,N)) where L is the actual sequence chunk length and N is the number of possible k-mers + O(min(L,N)*log(min(L,N))) if results are sorted by k-mer ID for extraction.
MGT::KmerStates::KmerStates | ( | int | kmerLen, |
const AbcConvCharToInt * | pAbcConv, | ||
RC_POLICY | revCompPolicy = RC_MERGE , |
||
KmerId | firstIdState = 1 |
||
) |
Constructor.
kmerLen | - length of k-mers |
pAbcConv | - pointer to AbcConvCharToInt alphabet convertor object (stored inside this KmerStates object but not managed). |
firstIdState | - start k-mer ids from this value. |
PKmerState MGT::KmerStates::firstState | ( | ) | [inline] |
Return pointer to the very first KmerState in the state machine.
Each k-mer count accumulation cycle starts from this state.
const PKmerState MGT::KmerStates::firstState | ( | ) | const [inline] |
Return constant pointer to the very first KmerState in the state machine.
Each k-mer count accumulation cycle starts from this state.
PKmerStateData MGT::KmerStates::getData | ( | PKmerState | pState ) | [inline] |
Return pointer to KmerStateData object linked to the given state.
pState | - current state |
PKmerStateData MGT::KmerStates::getData | ( | Ind | ind ) | [inline] |
Return pointer to KmerStateData object linked to the state with a given index.
ind | - index of the state |
KmerId MGT::KmerStates::getFirstIdState | ( | ) | const [inline] |
Return first state ID (for non-degenerate k-mers).
const Kmer & MGT::KmerStates::getKmer | ( | Ind | ind ) | const [inline] |
Return reference to a Kmer object given its index obtained earlier with indState().
ind | Index of the state This is done in constant time by indexing the internal array. |
KmerId MGT::KmerStates::getLastIdState | ( | ) | const [inline] |
Return last used plus one state ID (for non-degenerate k-mers).
int MGT::KmerStates::getNumIds | ( | ) | const [inline] |
Get the total number of different IDs ("number of features") - depends on the RC_POLICY.
const PKmerState MGT::KmerStates::getState | ( | Ind | ind ) | const [inline] |
Return pointer to a KmerState object given its index obtained earlier with indState().
ind | Index of the state This is done in constant time by indexing the internal array. |
KmerId MGT::KmerStates::idState | ( | PKmerState | pState ) | const [inline] |
Return unique ID of the state's k-mer.
This is stable between invocations of the program. IDs fill the half-open interval [getFirstIdState(),getLastIdState()) (for the first non-degenerate k-mer).
Ind MGT::KmerStates::indState | ( | const PKmerState | pState ) | const [inline] |
Return index of a given KmerState object that can be used to access this object later with getState().
pState | - pointer to state This is done in constant time based on position of pState in internal array. |
void MGT::KmerStates::initRevCompl | ( | ) | [protected] |
Initialize reverse-complement links in KmerState objects.
bool MGT::KmerStates::isDegenState | ( | PKmerState | pState ) | const [inline] |
Return true if a given state corresponds to a degenerate k-mer.
bool MGT::KmerStates::isRevComp | ( | PKmerState | pState ) | const [inline] |
Return true if this state is marked as reverse-complement.
This method guarantees that only one of the two mutually reverse-complement states is marked as such, and that it is always the same one between different invocations of the program.
const Kmer & MGT::KmerStates::kmerState | ( | PKmerState | pState ) | const [inline] |
Ind MGT::KmerStates::kmerToIndex | ( | const Kmer & | kmer ) | const [inline] |
Return position of Kmer object in internal array.
kmer | - Kmer object This is done based on value of kmer, and complexity is linear in kmer length. This is made public for the use of implementation-aware code such as KmerCounterLadder. |
PKmerState MGT::KmerStates::kmerToState | ( | const Kmer & | kmer ) | const [inline, protected] |
PKmerState MGT::KmerStates::nextState | ( | PKmerState | pState, |
INuc | c | ||
) | [inline] |
Return pointer to the next KmerState in the state machine.
pState | - current state |
c | - nucleotide integer code |
int MGT::KmerStates::numStates | ( | ) | const [inline] |
Return total number of KmerState states in the state machine (normal and degenerate).
std::ostream & MGT::KmerStates::print | ( | std::ostream & | out ) | const |
Print KmerStates object for debugging.
PKmerState MGT::KmerStates::revCompState | ( | PKmerState | pState ) | [inline] |
Return pointer to the reverse complement KmerState.
pState | - current state |
PKmerState MGT::KmerStates::revCompStateFirst | ( | PKmerState | pState ) | [inline] |
void MGT::KmerStates::setData | ( | PKmerState | pState, |
PKmerStateData | pStateData | ||
) | [inline] |
Link given KmerState object to the given KmerStateData object.
This sets only uni-directed link from KmerState to KmerStateData.
pState | - pointer to state |
pStateData | - pointer to data |
int MGT::KmerStates::iFirstStateNonDegen [protected] |
Index of the first non-degenerate state in m_states.
KmerId MGT::KmerStates::m_firstIdState [protected] |
IDs of output (non-degenerate, non-reverse-complement) k-mers fill a half open interval [m_firstIdState,m_lastIdState).