next up previous contents
Next: Named entity classification data Up: Using the sample main Previous: Word form dictionary file   Contents


Named entity recognition data file

This file controls the behaviour of the simple NE recognizer. It consists of four sections:

Section <FunctionWords> lists the function words that can be embeeded inside a proper noun (e.g. preposisions and articles such as those in ``Banco de Espaņa'' or ``Foundation for the Eradication of Poverty''). For instance:

<FunctionWords>
el
la
los
las
de
del
para
</FunctionWords>

Section <SpecialPunct> lists the PoS tags (according to punctuation tags definition file, section 2.13) after which a capitalized word may be indicating just a sentence or clause beggining and not necessarily a named entity. Typical cases are colon, open parenthesis, dot, hyphen..

<SpecialPunct>
Fpa
Fp
Fd
Fg
</SpecialPunct>

Section <NE_Tag> contains only one line with the PoS tag that will be assigned to the recognized entities. If the NE classifier is going to be used later, it will have to be informed of this tag at creation time.

<NE_Tag>
NP00000
</NE_Tag>

Section <TitleLimit> contains only one line with an integer value stating the length beyond which a sentence written entirely in uppercase will be considered a title and not a proper noun. Example:

<TitleLimit>
3
</TitleLimit>

If TitleLimit=0 (the default) title detection is deactivated (i.e, all-uppercase sentences are always marked as named entities).

The idea of this heuristic is that newspaper titles are usually written in uppercase, and tend to have at least two or three words, while named entities written in this way tend to be acronyms (e.g. IBM, DARPA, ...) and usually have at most one or two words.

For instance, if TitleLimit=3 the sentence FREELING ENTERS NASDAC UNDER CLOSE INTEREST OF MARKET ANALISTS will not be recognized as a named entity, and will have its words analyzed independently. On the other hand, the sentence IBM INC., having less than 3 words, will be considered a proper noun.

Obviously this heuristic is not 100% accurate, but in some cases (e.g. if you are analyzing newspapers) it may be preferrable to the default behaviour.


next up previous contents
Next: Named entity classification data Up: Using the sample main Previous: Word form dictionary file   Contents
2006-04-26