Public Member Functions | Protected Attributes

MGT::KmerCounter Class Reference

Class that computes k-mer counts of incoming nucleotide sequence. More...

#include <kmers.hpp>

Inheritance diagram for MGT::KmerCounter:
MGT::Kmers::KmerSparseFeatures

List of all members.

Public Member Functions

 KmerCounter (int kmerLen, const AbcConvCharToInt *pAbcConv=0, RC_POLICY revCompPolicy=RC_MERGE, KmerId firstIdState=1)
 A constructor.
void doCNuc (CNuc cnuc)
 This method is called to process the input sequence.
void doINuc (INuc inuc)
 This method is called for each element of an input sequence (through KmerCounter::doCNuc() wrapper method).
int maxNumKmers (ULong seqLen) const
 Return the maximum number of unique k-mers that can be found in a sequence of a given length.
KmerId getFirstIdState () const
 Return first state ID (for non-degenerate k-mers).
KmerId getLastIdState () const
 Return last used plus one state ID (for non-degenerate k-mers).
int getNumIds () const
 Get the total number of different IDs ("number of features") - depends on the RC_POLICY.
KmerStatesgetStates ()
 Get reference to internal KmerStates array.
Interface to extract the results.

The following set of methods defines the result extraction protocol.

They must be called in a strict order because they change the internal state of KmerCounter object. The interface is designed for efficiency and flexibility of the caller. After doCNuc() has been called any number of times (we call this "accumulation cycle"), the extraction is done as in the following code sample, after which doCNuc() can be called again to accumulate new counts.

 int n = o.numKmers();
 o.startKmer();
 for(int i = 0; i < n; i++,o.nextKmer()) {
     cout << o.getKmerId() << ":" << o.getKmerCount() << "\n";
 }
 o.finishKmer();

nextKmer() can be called less than numKmers() times. sumKmerCounts() can be called outside of startKmer()...finishKmer() block

int numKmers () const
 Return number of k-mers found so far in current accumulation cycle.
ULong sumKmerCounts () const
 Return sum of non-degenerate k-mer counts found so far in current accumulation cycle.
ULong sumDegenKmerCounts () const
 Return sum of degenerate k-mer counts found so far in current accumulation cycle.
void startKmer (bool doSort=true)
 Prepare internal state for result extraction.
void nextKmer ()
 Advance internal state to extract next k-mer results.
ULong getKmerCount () const
 Accessor to get count value from the currently extracted k-mer.
int getKmerId () const
 Accessor to get Id from the currently extracted k-mer.
std::string getKmerStr () const
 Accessor to get k-mer string (such as 'ACCCT') for the currently extracted k-mer.
Ind indState () const
 Accessor to get index of the state for the currently extracted k-mer.
const PKmerState getState () const
 Accessor to get pointer to the state for the currently extracted k-mer.
void finishKmer ()
 Finalize result extraction cycle - new series of doCNuc() calls can be done afterwards.

Protected Attributes

RC_POLICY m_revCompPolicy
 How reverse complements are treated - one of RC_XXX.
std::vector< KmerStateDatam_data
 Preallocated array of KmerStateData objects.
KmerStateData m_dataDegen
 One dummy KmerStateData object that is linked to the zero state, which in turn is set as a reverse-complement one for all other degenerate states.
int m_iDataEnd
 Index of the first unused element in m_data.
int m_iDataExtr
 Index of m_data element that is currently being extracted.

Detailed Description

Class that computes k-mer counts of incoming nucleotide sequence.

It processes the input sequence and extracts the counts. Internally, it maintains the KmerStateData payload data and moves through states of the KmerStates state machine in response to incoming nucleotides.


Constructor & Destructor Documentation

MGT::KmerCounter::KmerCounter ( int  kmerLen,
const AbcConvCharToInt pAbcConv = 0,
RC_POLICY  revCompPolicy = RC_MERGE,
KmerId  firstIdState = 1 
)

A constructor.

Parameters:
kmerLenis a length of a k-mer. In the current implementation, all kmers are precalculated and stored in memory, so be reasonable with this parameter.
pAbcConvis a to AbcConvCharToInt alphabet convertor (stored inside this KmerCounter object but not managed).
revCompPolicywhat to do about reverse complement k-mers
firstIdStateStart k-mer IDs from this value (default 1 as in SVMLight)

Member Function Documentation

void MGT::KmerCounter::doCNuc ( CNuc  cnuc ) [inline]

This method is called to process the input sequence.

Series of calls to this method are interleaved with calls to result extraction methods.

Parameters:
cnucone nucleotide character value (such as 'A')
void MGT::KmerCounter::finishKmer (  ) [inline]

Finalize result extraction cycle - new series of doCNuc() calls can be done afterwards.

KmerId MGT::KmerCounter::getFirstIdState (  ) const [inline]

Return first state ID (for non-degenerate k-mers).

ULong MGT::KmerCounter::getKmerCount (  ) const [inline]

Accessor to get count value from the currently extracted k-mer.

int MGT::KmerCounter::getKmerId (  ) const [inline]

Accessor to get Id from the currently extracted k-mer.

std::string MGT::KmerCounter::getKmerStr (  ) const [inline]

Accessor to get k-mer string (such as 'ACCCT') for the currently extracted k-mer.

KmerId MGT::KmerCounter::getLastIdState (  ) const [inline]

Return last used plus one state ID (for non-degenerate k-mers).

int MGT::KmerCounter::getNumIds (  ) const [inline]

Get the total number of different IDs ("number of features") - depends on the RC_POLICY.

const PKmerState MGT::KmerCounter::getState (  ) const [inline]

Accessor to get pointer to the state for the currently extracted k-mer.

Should be used only by implementation-aware code such as KmerCounterLadder

KmerStates& MGT::KmerCounter::getStates (  ) [inline]

Get reference to internal KmerStates array.

Declared public only to be used by implementation-aware code such as KmerCounterLadder.

Ind MGT::KmerCounter::indState (  ) const [inline]

Accessor to get index of the state for the currently extracted k-mer.

Should be used only by implementation-aware code such as KmerCounterLadder

int MGT::KmerCounter::maxNumKmers ( ULong  seqLen ) const [inline]

Return the maximum number of unique k-mers that can be found in a sequence of a given length.

Parameters:
seqLensequence length
void MGT::KmerCounter::nextKmer (  ) [inline]

Advance internal state to extract next k-mer results.

int MGT::KmerCounter::numKmers (  ) const [inline]

Return number of k-mers found so far in current accumulation cycle.

void MGT::KmerCounter::startKmer ( bool  doSort = true ) [inline]

Prepare internal state for result extraction.

Parameters:
doSort- if true, the results will be sorted (complexity will be N*log(N) where N is numKmers(). SVM sparse feature vector representation needs sorted results.
ULong MGT::KmerCounter::sumDegenKmerCounts (  ) const [inline]

Return sum of degenerate k-mer counts found so far in current accumulation cycle.

Can be only called outside of startKmer()...finishKmer() block. Complexity: constant time

ULong MGT::KmerCounter::sumKmerCounts (  ) const [inline]

Return sum of non-degenerate k-mer counts found so far in current accumulation cycle.

Can be only called outside of startKmer()...finishKmer() block. Complexity: linear in the number of found km-mers.


Member Data Documentation

std::vector<KmerStateData> MGT::KmerCounter::m_data [protected]

Preallocated array of KmerStateData objects.

Array size is equal to the total number of states. This guarantees that there is always enough data elements regardless of the input sequence length.

Todo:
Current array size is more than 50% excessive because we only store counts for one of every two reverse-complement states, and all degenerate states have zero state set as their reverse complement. This is a low priority optimization.

One dummy KmerStateData object that is linked to the zero state, which in turn is set as a reverse-complement one for all other degenerate states.

This way, it serves as a sink counter for all degenerate states. That in turn removes one branch condition from the time-critical code in doINuc(). We also query from it the total count of degenerate states in the last accumulation cycle.

Index of the first unused element in m_data.

Index of m_data element that is currently being extracted.

RC_POLICY MGT::KmerCounter::m_revCompPolicy [protected]

How reverse complements are treated - one of RC_XXX.


The documentation for this class was generated from the following files: