Public Member Functions | Protected Member Functions | Protected Attributes

MGT::KmerStates Class Reference

State machine for k-mers along with a corresponding payload object implemented in KmerStates and KmerCounter classes respectively. More...

#include <kmers.hpp>

List of all members.

Public Member Functions

 KmerStates (int kmerLen, const AbcConvCharToInt *pAbcConv, RC_POLICY revCompPolicy=RC_MERGE, KmerId firstIdState=1)
 Constructor.
PKmerState nextState (PKmerState pState, INuc c)
 Return pointer to the next KmerState in the state machine.
PKmerState revCompState (PKmerState pState)
 Return pointer to the reverse complement KmerState.
PKmerState revCompStateFirst (PKmerState pState)
 Return the state that we marked as "first" in a pair of mutually reverse-complement states.
bool isRevComp (PKmerState pState) const
 Return true if this state is marked as reverse-complement.
bool isDegenState (PKmerState pState) const
 Return true if a given state corresponds to a degenerate k-mer.
PKmerState firstState ()
 Return pointer to the very first KmerState in the state machine.
const PKmerState firstState () const
 Return constant pointer to the very first KmerState in the state machine.
int numStates () const
 Return total number of KmerState states in the state machine (normal and degenerate).
KmerId idState (PKmerState pState) const
 Return unique ID of the state's k-mer.
PKmerStateData getData (PKmerState pState)
 Return pointer to KmerStateData object linked to the given state.
PKmerStateData getData (Ind ind)
 Return pointer to KmerStateData object linked to the state with a given index.
void setData (PKmerState pState, PKmerStateData pStateData)
 Link given KmerState object to the given KmerStateData object.
const KmerkmerState (PKmerState pState) const
 Return reference to Kmer object for a given KmerState object.
Ind indState (const PKmerState pState) const
 Return index of a given KmerState object that can be used to access this object later with getState().
const PKmerState getState (Ind ind) const
 Return pointer to a KmerState object given its index obtained earlier with indState().
const KmergetKmer (Ind ind) const
 Return reference to a Kmer object given its index obtained earlier with indState().
KmerId getFirstIdState () const
 Return first state ID (for non-degenerate k-mers).
KmerId getLastIdState () const
 Return last used plus one state ID (for non-degenerate k-mers).
int getNumIds () const
 Get the total number of different IDs ("number of features") - depends on the RC_POLICY.
std::ostream & print (std::ostream &out) const
 Print KmerStates object for debugging.
Ind kmerToIndex (const Kmer &kmer) const
 Return position of Kmer object in internal array.

Protected Member Functions

PKmerState kmerToState (const Kmer &kmer) const
 Return pointer to KmerState for a given Kmer object.
void nextKmer (const Kmer &currKmer, INuc c, Kmer &nextKmer) const
 Fill next Kmer object based on current Kmer and incoming nucleotide code.
void initAllKmers ()
 Initialize all KmerState and Kmer objects.
void initRevCompl ()
 Initialize reverse-complement links in KmerState objects.

Protected Attributes

int iFirstStateNonDegen
 Index of the first non-degenerate state in m_states.
KmerId m_firstIdState
 IDs of output (non-degenerate, non-reverse-complement) k-mers fill a half open interval [m_firstIdState,m_lastIdState).

Detailed Description

State machine for k-mers along with a corresponding payload object implemented in KmerStates and KmerCounter classes respectively.

The current implementation is optimized for the following use case:

  1. relatively short k-mers (k < 10)
  2. k-mer counts will be computed for many short chunks of sequence, which assumes the following set of operations
    • reset counts
    • stream through sequence chunk
    • report the counts
    • repeat
    Typical length of sequence chunk (~1000 nuc) is less than the number of possible k-mers (~16,000 for k = 7).
  3. k-mer counts will be extracted only for non-redundant subset of reverse complement k-mers (~8,000 for k = 7)

In view of 2, we will keep the "payload" for each state node separately, referenced by the pointer to an instance of KmerStateData class. The payload object will contain the back reference to its state and the count, and the payloads can be allocated as needed, from a pre-allocated array. The size of payload array is bound by the max chunk length x 2 (direct plus reverse-complement sequence lengths), or the total number of possible k-mers for a given k, whichever is less. Therefore, we pre-allocate storage for the payload array of with the total number of possible k-mers.

The current implementation uses dense k-mer state machine (all possible k-mers are preallocated and states are initialized) with sparse payload: all states have initially their payloads set to NULL. As we stream through the input sequence, we move from the current state to the next one. At each step, we take the first of the reverse complement pair of states, and check if its payload is NULL. If it is, we allocate the new payload object. Then, we increment the count in the payload object.

Any implementation of KmerStates must guarantee this: All degenerate states point to the very first (zero) state as their reverse-complement state. That allows KmerCounter to link zero state to a single dummy KmerStateData object. Thus, reverse-complement link is not symmetric for degenerate states.

When counts are to be extracted, we have to loop only through the payload objects for those k-mers that were actually seen and represent the non-redundant subset of all reverse-complement pairs.

After construction, this implementation guarantees the following complexity during the sequence processing: O(1) on k-mer length; O(min(L,N)) where L is the actual sequence chunk length and N is the number of possible k-mers + O(min(L,N)*log(min(L,N))) if results are sorted by k-mer ID for extraction.


Constructor & Destructor Documentation

MGT::KmerStates::KmerStates ( int  kmerLen,
const AbcConvCharToInt pAbcConv,
RC_POLICY  revCompPolicy = RC_MERGE,
KmerId  firstIdState = 1 
)

Constructor.

Parameters:
kmerLen- length of k-mers
pAbcConv- pointer to AbcConvCharToInt alphabet convertor object (stored inside this KmerStates object but not managed).
firstIdState- start k-mer ids from this value.

Member Function Documentation

PKmerState MGT::KmerStates::firstState (  ) [inline]

Return pointer to the very first KmerState in the state machine.

Each k-mer count accumulation cycle starts from this state.

const PKmerState MGT::KmerStates::firstState (  ) const [inline]

Return constant pointer to the very first KmerState in the state machine.

Each k-mer count accumulation cycle starts from this state.

PKmerStateData MGT::KmerStates::getData ( PKmerState  pState ) [inline]

Return pointer to KmerStateData object linked to the given state.

Parameters:
pState- current state
PKmerStateData MGT::KmerStates::getData ( Ind  ind ) [inline]

Return pointer to KmerStateData object linked to the state with a given index.

Parameters:
ind- index of the state
KmerId MGT::KmerStates::getFirstIdState (  ) const [inline]

Return first state ID (for non-degenerate k-mers).

const Kmer & MGT::KmerStates::getKmer ( Ind  ind ) const [inline]

Return reference to a Kmer object given its index obtained earlier with indState().

Parameters:
indIndex of the state This is done in constant time by indexing the internal array.
KmerId MGT::KmerStates::getLastIdState (  ) const [inline]

Return last used plus one state ID (for non-degenerate k-mers).

int MGT::KmerStates::getNumIds (  ) const [inline]

Get the total number of different IDs ("number of features") - depends on the RC_POLICY.

const PKmerState MGT::KmerStates::getState ( Ind  ind ) const [inline]

Return pointer to a KmerState object given its index obtained earlier with indState().

Parameters:
indIndex of the state This is done in constant time by indexing the internal array.
KmerId MGT::KmerStates::idState ( PKmerState  pState ) const [inline]

Return unique ID of the state's k-mer.

This is stable between invocations of the program. IDs fill the half-open interval [getFirstIdState(),getLastIdState()) (for the first non-degenerate k-mer).

Ind MGT::KmerStates::indState ( const PKmerState  pState ) const [inline]

Return index of a given KmerState object that can be used to access this object later with getState().

Parameters:
pState- pointer to state This is done in constant time based on position of pState in internal array.
void MGT::KmerStates::initAllKmers (  ) [protected]

Initialize all KmerState and Kmer objects.

void MGT::KmerStates::initRevCompl (  ) [protected]

Initialize reverse-complement links in KmerState objects.

bool MGT::KmerStates::isDegenState ( PKmerState  pState ) const [inline]

Return true if a given state corresponds to a degenerate k-mer.

bool MGT::KmerStates::isRevComp ( PKmerState  pState ) const [inline]

Return true if this state is marked as reverse-complement.

This method guarantees that only one of the two mutually reverse-complement states is marked as such, and that it is always the same one between different invocations of the program.

const Kmer & MGT::KmerStates::kmerState ( PKmerState  pState ) const [inline]

Return reference to Kmer object for a given KmerState object.

Parameters:
pState- pointer to state This is done in constant time based on position of pState in internal array.
Ind MGT::KmerStates::kmerToIndex ( const Kmer kmer ) const [inline]

Return position of Kmer object in internal array.

Parameters:
kmer- Kmer object This is done based on value of kmer, and complexity is linear in kmer length. This is made public for the use of implementation-aware code such as KmerCounterLadder.
PKmerState MGT::KmerStates::kmerToState ( const Kmer kmer ) const [inline, protected]

Return pointer to KmerState for a given Kmer object.

Parameters:
kmer- Kmer object. This is done based on value of kmer, and complexity is linear in kmer length.
void MGT::KmerStates::nextKmer ( const Kmer currKmer,
INuc  c,
Kmer nextKmer 
) const [inline, protected]

Fill next Kmer object based on current Kmer and incoming nucleotide code.

If c is a degenerate code, always output the first (zero) k-mer.

Parameters:
currKmer- current Kmer
c- nucleotide code
nextKmer- output Kmer
PKmerState MGT::KmerStates::nextState ( PKmerState  pState,
INuc  c 
) [inline]

Return pointer to the next KmerState in the state machine.

Parameters:
pState- current state
c- nucleotide integer code
int MGT::KmerStates::numStates (  ) const [inline]

Return total number of KmerState states in the state machine (normal and degenerate).

std::ostream & MGT::KmerStates::print ( std::ostream &  out ) const

Print KmerStates object for debugging.

PKmerState MGT::KmerStates::revCompState ( PKmerState  pState ) [inline]

Return pointer to the reverse complement KmerState.

Parameters:
pState- current state
PKmerState MGT::KmerStates::revCompStateFirst ( PKmerState  pState ) [inline]

Return the state that we marked as "first" in a pair of mutually reverse-complement states.

Parameters:
pStatepoints to any one of the two mutually rev-complement states.

Todo:
we can avoid the branch statement by storing a pointer to the "first" rev-comp state inside each KmerState object.

void MGT::KmerStates::setData ( PKmerState  pState,
PKmerStateData  pStateData 
) [inline]

Link given KmerState object to the given KmerStateData object.

This sets only uni-directed link from KmerState to KmerStateData.

Parameters:
pState- pointer to state
pStateData- pointer to data

Member Data Documentation

Index of the first non-degenerate state in m_states.

KmerId MGT::KmerStates::m_firstIdState [protected]

IDs of output (non-degenerate, non-reverse-complement) k-mers fill a half open interval [m_firstIdState,m_lastIdState).


The documentation for this class was generated from the following files: