Next: Morphological analyzer
Up: Extending the library with
Previous: Extending the library with
Contents
The first module in the processing chain is the tokenizer. As
described in section 2.2.1, the behaviour of the
tokenizer is controlled via the TokenizerFile option in
configuration file.
To create a tokenizer for a new language, just create a new
tokenization rules file (e.g. copying an existing one and adapting
its regexps to particularities of your language), and set
it as the value for the TokenizerFile option in your new
configuration file.
2006-04-26