Main Page | Namespace List | Class Hierarchy | Alphabetical List | Class List | Directories | File List | Class Members | File Members

tokenizer Class Reference

Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file. More...

#include <tokenizer.h>

List of all members.

Public Member Functions

 tokenizer (const string &)
 Constructor.
list< wordtokenize (const string &)
 tokenize string with default options
list< wordtokenize (const string &, int &)
 tokenize string with default options, tracking offset

Private Attributes

set< string > abrevs
 abreviations set (Dr. Mrs. etc. period is not separated)
vector< pair< string, RegEx > > rules
 tokenization rules
map< string, int > matches
 substrings to convert into tokens in each rule


Detailed Description

Class tokenizer implements a token splitter, which converts a string into a sequence of word objects, according to a set of tokenization rules read from aconfiguration file.


Constructor & Destructor Documentation

tokenizer::tokenizer const string &   ) 
 

Constructor.


Member Function Documentation

list< word > tokenizer::tokenize const string &  ,
int & 
 

tokenize string with default options, tracking offset

list< word > tokenizer::tokenize const string &   ) 
 

tokenize string with default options


Member Data Documentation

set<string> tokenizer::abrevs [private]
 

abreviations set (Dr. Mrs. etc. period is not separated)

map<string,int> tokenizer::matches [private]
 

substrings to convert into tokens in each rule

vector<pair<string,RegEx> > tokenizer::rules [private]
 

tokenization rules


The documentation for this class was generated from the following files:
Generated on Wed Apr 26 12:59:15 2006 for FreeLing by  doxygen 1.4.4