Windows Interface for Tree Tagger
Ciarán Ó Duibhín
Latest version of tagger program interface: 09 March 2008
Latest version of training program interface: 09 March 2008
Warning: Don't use versions of the tagger program interface downloaded between 23 and 26 February 2007 — with the tokenization option "none", your input file could be deleted!
For the latest version of the interface, see here. The older version described here cannot process material in UTF8 encoding when the tokenization options "built-in" or "own + built-in" are used, but it is allowed to remain available until the latest version is fully tested. It is possible to run this older version on UTF8 text, using tokenization option "none". For this to succeed, two things must be done in advance. Normal text must have been put in "one-token-per-line" format, eg. by running the perl script "utf8-tokenize.perl" on it. And the UTF8 parameter file must have been renamed to the name of the language followed by ".par", eg. english.par, german.par, french.par
The TreeTagger is a program developed by Helmut Schmid at the University of Stuttgart (now at the University of München), for part-of-speech tagging and lemmatization. Language parameters are supplied on the TreeTagger webpage for using the program with texts in English, French, German, Italian, Spanish, Russian, Bulgarian and Dutch, and parameters for some other languages are available from sites linked to the TreeTagger webpage. For a language for which no parameters exist, it is necessary to hand-tag some data, and then run a training program (provided with the TreeTagger) to create the parameters.
A zipped Windows distribution of the TreeTagger is available for download through a link near the end of the "download" section of the TreeTagger webpage. As supplied in that distribution, the programs have to be run from the MS-DOS command line, on which the required options are specified.
What is offered here is an add-on Windows graphic interface to the tagger program — and also a similar interface to the training program — which allows the options to be selected visually, and then the TreeTagger program to be launched, without the user having to switch from Windows to MS-DOS.
The selected set of options may be saved and re-loaded, similar to a ‘configuration file.’
Below, a screenshot of the Windows interface to the tagger program.
Below, a screenshot of the Windows interface to the training program.
Latest enhancements include:
Download the Windows interface to the tagger program.
Download the Windows interface to the training program.
If you already have TreeTagger installed, all you need do to add the graphic interface is to place the two interface programs (wintreetagger.exe and wintraintreetagger.exe) in the same directory as already contains tree-tagger.exe and train-tree-tagger.exe.
Do not make copies of the interface programs in other
directories, as such copies will be unable to find the TreeTagger components.
When you want to use either of the interface programs from another directory,
make a shortcut from there to it.
If you are installing the Windows TreeTagger distribution from the beginning, I would suggest the following procedure. It prepares the TreeTagger for use both from the MS-DOS command-line and from the Windows interface, but a number of steps (shown below in brown) may be omitted if use only from the Windows interface is intended.
1. From the TreeTagger website, download the Windows
TreeTagger distribution. Unzip it to C:\Program Files, with the "Use
folder names" box ticked in Winzip's "Extract" dialog — DON'T OMIT TO TICK THIS
BOX. (In Vista, right-click the zip file, choose "Extract all", and browse to
C:\Program Files.) This will create and populate the following directories:
C:\Program Files\TreeTagger
C:\Program Files\TreeTagger\lib
C:\Program Files\TreeTagger\bin
C:\Program Files\TreeTagger\cmd
Note: if you intend to use the TreeTagger only from the graphical interface, you will not need either of the two files which you will find in \cmd, and you will not need any of the .bat files in \bin. You will need the two .exe files in \bin, and all the other installed files.
2. Needed only if you intend to use
the TreeTagger from the MS-DOS command line:
Make the following changes in each of the .bat files in \bin:
• On line 3, replace
set
TAGDIR=C:\TreeTagger
by
set
TAGDIR=C:\Program Files\TreeTagger
• To handle the internal space in the new value of TAGDIR, we have to insert
quotes around various things in each of the .bat files (taking tag-english.bat
as an example):
Line 6: replace %LIB%\english.par by
"%LIB%\english.par"
Line 11: replace %CMD%\tokenize.pl by
"%CMD%\tokenize.pl"
and replace %LIB%\english-abbreviations by
"%LIB%\english-abbreviations"
and replace %BIN%\tree-tagger by
"%BIN%\tree-tagger"
Line 16: replace %CMD%\tokenize.pl by
"%CMD%\tokenize.pl"
and replace %LIB%\english-abbreviations by
"%LIB%\english-abbreviations"
and replace %BIN%\tree-tagger by
"%BIN%\tree-tagger"
With tag-spanish.bat, there will be two more such cases on line 17 and one on
line 18.
It may also be worthwhile to add a line
containing simply
pause
at the end of the batch file, to prevent the window showing the program's
progress from disappearing before you can read it.
3. Needed only if you intend to use
the TreeTagger from the MS-DOS command line:
Install a Perl interpreter (if you have not already installed one). You can
download Perl for free at http://www.perl.com/pub/language/info/software.html
Note: because the graphic interface removes the need for the two Perl scripts unpacked into \cmd, it enables the TreeTagger to run fully under Windows 95 (on which Perl is inoperable). Alternatively in Windows 95, you can prepare your input in one-per-line format, edit the batch files to remove the Perl parts, and run TreeTagger from the MS-DOS command line.
4. From the TreeTagger website, download the parameter files
for the languages you need, decompress them (eg. using Winzip) and move them to the
subdirectory C:\Program Files\TreeTagger\lib.
If necessary, rename the parameter files to <language>.par, eg. to use
the "small German" parameters, rename german-small-par-linux-3.1.bin to
german.par
5. Needed only if you intend to use
the TreeTagger from the MS-DOS command line:
Add the TreeTagger \bin directory to the MS-DOS path, as follows.
• In older versions of Windows, go to the MS-DOS command line and edit the
following line into the file autoexec.bat, after any other set PATH= lines:
set
PATH=C:\Program Files\TreeTagger\bin;%PATH%
This change may not take effect until you reboot the machine.
• In newer versions of Windows, change the system PATH variable; for example,
in Windows XP, right-click My Computer; (left-)click Properties; click
Advanced; click Environment Variables; highlight the System variable
"Path", click Edit, and add the letters
C:\Program
Files\TreeTagger\bin;
to the beginning of the existing value.
6. To add the graphic interface, simply place the two interface programs (wintreetagger.exe and wintraintreetagger.exe) into C:\Program Files\TreeTagger\bin, alongside the two .exe files from the TreeTagger distribution (tree-tagger.exe and train-tree-tagger.exe). Do NOT make copies of the interface programs in other directories, as such copies will be unable to find the TreeTagger components. When you want to use either of the interface programs from another directory, make a shortcut from there to it.
Now you can test the TreeTagger. Find or create a plain text
file containing a small piece of running text in English — not more than a
couple of hundred words for a start. Say the file is called sample.txt.
If you installed the TreeTagger for use from the MS-DOS command line, open the
MS-DOS command-line processor, go to the directory containing sample.txt, and
type
tag-english sample.txt
If you installed the graphic interface, stay in Windows, go to the directory
containing sample.txt, make a short-cut to C:\Program
Files\TreeTagger\bin\wintreetagger.exe and run the short-cut.
The TreeTagger requires text for tagging to be in a one-per-line format, ie. each token must be on a separate line. Tokens may have internal spaces, ie. be multi-word units. Each punctuation mark is treated as a token and should be on a separate line too. Clitics should also be on separate lines, if they were treated as separate tokens in the language's training data. The character-set should be the same as that of the training data — Latin-1 for most languages, Unicode UTF8 for a couple of others. Text in the one-per-line format may optionally contain manual tagging for some tokens, in the form of the probability and/or the lemma.
This interface provides several options for tokenization, that is, for taking an input file of normal running text and delivering it to the TreeTagger in the one-per-line format:
The built-in tokenization may not work for texts in Unicode.
The built-in tokenization performs the following operations:
There are several further optional actions in the built-in tokenization:
The following features are under consideration for addition to the interface in the future:
This interface is offered as a facility for corpus analysis on Windows. By using it, you are deemed to accept that the author bears no responsibility for any adverse consequences. Needless to say, he hopes that there will be no such consequences. He will be pleased to receive comments, but cannot promise to act upon them.