Next: Contributions
Up: FreeLing: Natural language analysis
Previous: FreeLing: Natural language analysis
Contents
The distributed version includes morphological dictionaries for
covered languages (English, Spanish, Catalan, Galician, and Italian):
- The Spanish and Catalan dictionaries are hand build, and contain
the 6,000 most frequent open-category lemmas for each language, plus
all closed-category lemmas. The Spanish and Catalan dictionaries
try to maintain the same coverage (that is, the same lemmas are
expected to appear in both dictionaries).
The Spanish dictionary contains over 81,000 forms corresponding
to more than 7,100 different combinations lemma-PoS, and the Catalan
dictionary contains near 67,000 forms corresponding to more than
7,400 different combinations lemma-PoS.
- The Galician dictionary was get from OpenTrad project (a nice
open source Machine Translation project at www.opentrad.org), and contains over 90,000
forms corresponding to near 7,500 lemma-PoS combinations.
These data are distributed under their original Creative Commons
license, see THANKS and COPYING files for further information.
- The English dictionary was automatically extracted from WSJ,
with minimum manual post-edition, and thus may be a little noisy.
It contains over 160,000 forms corresponding to some 102,000
different combinations lemma-PoS.
- The Italian dictionary is extracted from Morph-it! lexicon
developed the University of Bologna, and contains over 360,000
forms corresponding to more than 40,000 lemma-PoS combinations.
These data are distributed under their original Creative Commons
license, see THANKS and COPYING files for further information.
Smaller dictionaries (Spanish, Catalan and Galician) are expected to cover
over 80% of open-category tokens in a text. For words not found
in the dictionary, all open categories are assumed, with a
probability distribution based on word suffixes, which includes
the right tag for 99% of the words, and allow the tagger
to make the most suitable choice based on tag sequence probability.
This version also includes WordNet-based sense dictionaries for covered languages:
Next: Contributions
Up: FreeLing: Natural language analysis
Previous: FreeLing: Natural language analysis
Contents
2006-04-26