The Tobar na Gaedhilge system for lemmatizing Gaelic texts
This system is still in active development and many aspects of it are expected to evolve further. This description of it will likewise evolve, hopefully towards eventual completeness.
The markup of lemmatization
In a lemmatized Gaelic text, each token will be followed by a code, placed between guillemets.
This code is intended to distinguish the senses of the form; the actual code is arbitrary and has no intrinsic meaning, but serves only as a distinguisher.
For example:
bog«2» — adjective ("soft"), citation form and cases having the same form
bog«3.2p» — verb ("move"), past, most persons
bog«3.2» — verb ("move"), imperative 2s
bogadh«1.1.2» — verb ("move"), verbal noun
bogaidh«3.2» — verb ("move"), imperative 2p
bogfadh«3.2» — verb ("move"), conditional, most persons
bogtha«3.2» — verb ("move"), verbal adjective
boig«2» — adjective ("soft"), genitive singular masculine and vocative singular masculine and dative singular feminine
boige«1.2» — adjective ("soft"), comparative and genitive singular feminine
Comments:
• we do not separately encode forms differing only in upper/lower case, and which have the same meaning, eg. bog«2» also covers Bog«2», BOG«2», etc.
• we do not separately encode mutated forms; eg. bog«2» also covers bhog«2» and mbog«2»; athair«1» covers t-athair«1», n-athair«1», h-athair«1», etc.
• at this time, we sometimes encode separately forms which are homographic but differ in morphology; eg. we distinguish above between the imperative 2s and past of a verb,
but not between the homographic persons of the past tense of a verb, and not between the homographic cases of a noun or those of an adjective
• other inflected forms can always be added to lists such as the above — we accumulate them as they are encountered
• although the codes are arbitrary, some patterning is already discernible in the above list, but this is simply an aid to human memory in working with them;
there is also some similarity to the numbering of FGB entries — but it is not possible to apply this principle in every case
• the fineness of the morphological distinctions is still very variable — at this stage, cases are not separately coded whereas tenses/persons sometimes are
**** The fineness of sense distinction between lemmas:
In the semi-automatically lemmatized texts, sense distinctions will (as yet) rarely extend below the level of the main entries in FGB,
eg. bog«1», bog«2» and bog«3» will be separated, as they are in main entries in FGB, but bog«2.1» or bog«2.2» will not be separated from bog«2».
In the manually-lemmatized texts, however, the fineness will often extend to within-entry FGB divisions, or to morphological subdivisions.
All captured levels of distinction will be preserved in the lemmatized index: for example, if you seek garbh«1.3» or tiomáin«1.2p»,
you may find them as such in the manually-lemmatized texts, but occurrences in other texts will be merged with other subsenses under garbh«1» or tiomáin«1».****
Program Lemmatize
This program assists in the interactive manual lemmatization of Gaelic texts which are prepared in accordance with the conventions of the Tobar na Gaedhilge project.
It obtains the potential senses of a form from an orthographic database, eg. for each token of the form bog, at least the three choices listed above will be offered.
Further, the database connects such senses as, for example, bog«2», boig«2» and boige«1.2», as pertaining to the same group;
and likewise bog«3.2», bog«4.2», bogadh«1.1.2», bogaidh«3.2», bogfadh«3.2» and bogtha«3.2».
To each such group of forms are assigned properties, such as citation-form (lemma) and part-of-speech.
That is, the sense code determines the lemma (and much else). We will freely refer to the assignment of sense codes as lemmatization,
although strictly speaking, it only enables lemmatization by supporting — in conjunction with the database — the later construction
of a lemmatized index to a text or collection of texts, like Tobar na Gaedhilge. This is its principal application.
The structure and content of this database will be presented later. For now, we describe the operation of the program which uses it to assign sense codes.
Program Lemmatize has (at this moment) two separate modes of operation.
The first and principal mode supports the exploration of an individual text, and the assignment of sense codes to its forms.
The second mode supports changing the sense code attached to a particular sense of a particular form, globally across a range of texts.
Mode: exploration and sense code assignment in a single text
The program begins by asking which text is to be lemmatized.
The text should be in a plain-text file, prepared in accordance with the format of the Tobar na Gaedhilge project.
It is allowable for some tokens of the text to be already lemmatized; they will not be silently overwritten.
The beginning of the file is displayed, with the first token, highlighted, presented for lemmatization.
For readability, the text is shown with any existing sense codes hidden, but a status line, directly below the text,
confirms the current token and its presently assigned sense code, if any.
The program may simply be used to navigate through the text, without lemmatizing or making any other changes.
In navigating, there is a choice between moving back to the start of the text (button: "Tús an téacs")
or moving to the next token (button: "Athshampla").
The meaning of "next token" is determined by the checkboxes, choosing between:
• the next token in sequence (of whatever form) — no checkbox clicked
• the next token of the current form, or of any other nominated form or form-pattern
• the next token already lemmatized to a nominated lemma or lemma-pattern
• the next unlemmatized token (of whatever form)
• the next token marked up as problematic
Note: more than one checkbox may be clicked, eg. you may search for the next token of a specified form, which has a specified lemma attached.
Note: the form to be sought should be entered in lower-case; it will match either upper- or lower-case in the text.
When you arrive at a token which you wish to lemmatize, click the button "Toisigh ar Leimeachán".
You can always switch off further lemmatization and return to simple navigation using the button "Scoir den Leimachán"
The options for lemmatizing the current token are:
• select a lemma from a list of possiblities, derived from the database
• "Cruthuigh ceangal úr don tsampla": create a new lemma *
• "Fág ceangal an tsampla mar atá": make no change
• "Dícheangail": remove any existing lemma, leaving the token unlemmatized
Having decided which action to apply, you are invited to confirm it.
* If you choose to add a new lemma, you may leave Program Lemmatize running
while you add the new lemma to the database plain-text files (see later) using any editor.
When you have finished editing and saving the files, resume Program Lemmatize,
and the new lemma should now appear among the available choices.
Below the status line, you are reminded of the current navigation mode,
as well as the lemma you have chosen to apply to the current token.
Confirmation options are:
• "Ceangail sampla agus faigh athshampla": apply the selected action to the token, and move to the next token (according to the navigation mode in effect)
• "Ná ceangail ach faigh athshampla": make no change, but move to the next token (according to the navigation mode in effect)
• "Fan ar an tsampla seo": make no change and remain on this token
• "Ceangail gach sampla": apply the selected action to ALL tokens of the current form
When you are ready to leave the program, you may save the modified text.
The button "Cuir an téacs i dtaisce" is to be found near the bottom left of the program window.
This button can also be used to save the text at any time during the lemmatization process.
The button "Fág an téacs" will leave the program WITHOUT SAVING THE TEXT AT THE SAME TIME.
Use this button if you have just saved the text, or if you wish to abandon any changes made to
the text since the last time you saved it (or since the program run started, if you did not save it).
If you wish to remove any lemmas added to the database during the program run, you may use a
text editor to remove the newly-added lines from the database plain-text files.
When lemmatizing, there is the possibility to use a "stop-list" of common word-forms.
In that case, Program Lemmatize will skip over such forms, and allow the user to concentrate
on the less common, and therefore more significant, forms.
Mode: Global change of sense code across texts
Programmed but still to be written up here.
The database
The database consist of several plain-text files. The structure of these files, and even their number, is not finalized.
They are not fully "normalized" in the database sense, as it is easier to update them manually like this.
When their content is more complete, it is expected they will be put fully in normal form.
The set of plain-text files at present consists of:
• the forms file (not used at present)
• the formsenses file
• the formsenseexamples file
• the decompositions file
• the decompositionexamples file
The content of the database is specialized towards handling the spelling found in the texts of the Ulaidh collection,
including phonetically-reasonable variants.
It is far from complete, and will be incrementally developed, as more texts are lemmatized.
Similar databases could be constructed for the orthographies of other text collections.
Program Lexicon (see below) provides facilities for manipulating this database,
including conversion of the set of plain-text files to a SQLite database.
The formsenses file
Data about these forms is maintained manually in a plain-text file, called formsenses.txt.
The file will contain a record for every distinct sense of every distinct word-type, ie. every distinct type/sense combination.
There may also be separate records for different morphological categories of a type/sense, even when homonymic. This is still being refined.
A record contains a number of fields as follows. (The fields are re-ordered here, for ease of explanation; in the file they are ordered for ease of maintenance.)
FORM 1– 23 the form as found in text, but lowercased and (except when the mutation is permanent) demutated
SENSE 24– 38 the sense, an arbitrary code; serves a similar function to FGB's superscripts and numbered divisions
KEY_RAW 126–159 form+sense — this field uniquely determines the following fields until otherwise stated
SORT 464–467 controls the order in which the senses for a form should be displayed by programs, eg. during lemmatisation of that form in a text
ORTHOG_1 160–177 provisional orthography for this form+sense, unifying variants which imply no phonetic difference,
e.g. gáiridh may be unified with gáirí; but a' ( < an) may not be unified with an
USAGE 178–226 may contain example of use, to guide human application
STATUS 227–255 may contain comment such as "unused" or "no example in texts so far lemmatized"
SPELLING-SPECIFIC 256–324 may contain comment such as database compounds which contain this form+sense, or other hints
NOTES 513–end
FGB 39–125 FGB information for this form+sense
KEY_DESPELL 325–358 unifies spellings which vary phonetically with the context — this field uniquely determines the following fields until otherwise stated
e.g. a' ( < an) and 'n ( < an) are unified with an; rachad with rachadh;
ORTHOG_2 359–380 provisional context-independent orthography for group of forms sharing same KEY_DESPELL
INFLECTION-SPECIFIC 381–413 inflectional category for this KEY_DESPELL, eg. (gs) for genitive singular; (past aut abs) for absolute past autonomous
KEY_DEINFLECT 414–440 unifies inflections — this field uniquely determines the following fields until otherwise stated
KEY_4 441–463 spare field for unification — this field uniquely determines the following fields until otherwise stated
MEANING 468–512 part of speech and distinctive gloss
Several cautions:
1. It will be seen that, moving through the columns from left to right, there is a gradual unification from forms to lemmas.
At the present stage, many of the intermediate columns are not in actual use, and in order to expedite development of the main
function of the database, the burden of unification has often been largely deferred until ORTHOG_2, where it happens all at once.
Later, when programming requires the use of intermediate columns, the unification will have to be moved back appropriately, in accordance with the stated
purposes of the columns. In the meantime, a single question mark placed on ORTHOG_2 means only that consideration of this movement has still to take place.
****It is still to be properly considered whether níghean and nighean; iomchur and iomchar; comhairle and cómhairle; scian and sgian; ceo and ceó should be unified at ORTHOG1 or ORTHOG_2.
****KEY_DEINFLECT could be used to preserve phonetically-different forms with the same morphological function, eg. fuilstin/fuilingt, faduigh/fadóigh which will be unified in KEY4.
ORTHOG_1: common representation of spellings which do not imply a phonetic difference; gáiridhe, gáirí, gáiridh
ORTHOG_2: common representation of spellings which are conditioned by context; an, a', 'n
KEY_DEINFLECT (rename KEY_3): common representation of alternative forms which are functionally identical; fuilingt, fuilstin ??
KEY_4 (rename KEY_DEINFLECT): common representation of forms belonging to the same paradigm (ie. lemma); fuilingt, fulaing ??
2. The values placed in intermediate and later columns, eg. ORTHOG_1, ORTHOG_2, KEY_4, are provisional, and are liable to change when the data has been assembled and can be
examined comprehensively. The groupings which are evolving are not expected to change greatly, but the values representing them will certainly do so.
3. "No example" in the STATUS colmn means that the record was inserted to serve as a citation form, and that no actual example had been found by that time.
Of course, an example may appear later, but the annotation has not been removed.
When the database is substantially complete, the "no example" status should be calculated, to replace these manually entered values.
Some example records follow:
FORM boig bogaidh níghean iomchur gáiridh a'
SENSE «2» «3.2» «1» «1.1» «1.2» «1»
KEY_RAW boig_2 bogaidh_3.2 níghean_1 iomchur_1.1 gáiridh_1.2 a'_1
SORT 001
ORTHOG boig bogaidh níghean iomchur *gáirí a'
USAGE
STATUS
SPELLING-SPECIFIC component of 'n«3»|s«3»|'s«4»+a'«1»
NOTES officially an except in composite s«3»|'s«4»+a'«1» which is sa
FGB boig : bog_2 = bogaigí : bog_3.2; sibh_0 = nighean_0 = iníon : iníon_0 = iompar : iompair_1 =gáirí_1.1 : gáire_1.2 a or an : an_1
KEY_DESPELL boig_2 *bogaidh_3.2 *nighean_1 *iomchar_1.1 gáirí_1.2 *an_1
ORTHOG boig bogaidh nighean iomchar gáirí an
INFLECTION-SPECIFIC (gsm) (imperative 2p) (vn) (p)
KEY_DEINFLECT bog_2 bog_4.2; sibh_1 nighean_1 iomchair_3 gáire_1.2 an_1
KEY_4 bog_2 bog_4.2; sibh_1 nighean_1 iomchair_3 gáire_1.2 an_1
MEANING J, soft, mild, quiet V, move, rock; you Nc, daughter V, carry Nc, laugh T,"an"
Since certain fields depend only on KEY_DESPELL or KEY_DEINFLECT or KEY_4, this file could be normalised into several different files,
but at this stage manual maintainance is simpler with a single file, despite the duplication of the data in those dependent fields.
The part-of-speech (POS) tags, used in the MEANING field of the formsenses file, are:
C: conjunction
E: emphatic suffix
F: pronoun
I: interjection
J: adjective
M: numeral
Nc: common noun
Np: proper noun
S: preposition
T: article
V: verb
Z: other
Further refinements are envisaged, including subdivision of the Z category.
The decompositions file
In our basic text markup, we frequently use the plus character to divide forms, where this is appropriate, eg. do+'n. Each of the constituent forms has its own lemma.
****
But eg leisean — should we drop decompositions? and later, drop the + sign ?
Pro: It would make the forms index more surfacey.
Contra: Form would not be segmented.
But there are forms which cannot be cleanly divided, yet which we regard as containing two or more lemmas (eg. in "fhad is toil le Dia", where is repesents the lemmas agus and is).
A more general mechanism is required, and is provided by holding composites, whether or not they are cleanly divisible into forms, as records in a second plain-text file, the decompositions file.
The fields of the decompositions file are:
FORM 1– 18 the (composite) form
DECOMPNUMBER 19– 36 a number to distinguish homographic composite forms; form+decompnumber make up the unique key which determines the content of the remaining fields
FGBOrthog 37– 51 FGB orthography of the components
FGBLemma 52– 73 FGB lemmas of the components
DECOMPOSITION 74– 99 decomposition into form«sense» units of the formsenses file
CONDITIONS 100–139
Hint (Despelled) 140–189 decomposition into KEY_4 units of the formsenses file
SORT 190–193 controls the order in which the decompositions of a form should be displayed by programs
HINT 194–243 eg. gloss
NOTES 244–
Some example records follow:
FORM is
DECOMPNUMBER *1
FGBOrthog agus is
FGBLemma agus is_1
DECOMPOSITION «26»+is«1»
CONDITIONS
Hint (Despelled) agus_1 + is_1
SORT 501
HINT
NOTES
is *1 agus is agus is_1 «26»+is«1» agus_1 + is_1 501
****
eg.
One advantage over plus is that, should we decide to change globally the division of a composite form (eg. ), this requires only amendment of the decomposition record,
without going back through all the texts.
Another advantage is that using the decompositions file to divide the form (for example) arb restricts the choices for the ar to the relative or interrogative particle,
whereas the markup ar+b leaves a wider range of possibilities for ar at a later time, since the contextualizing effect of b is lost.
Dá: 2 (3) unitary, many composite
Two files of examples, formsenseexamples.txt and decompositionexamples.txt, which are as yet poorly populated and little used.
Program Lexicon
This program is used to maintain the database which Program Lemmatize uses in the manual lemmatization of a Gaelic text.
This program performs several oparations:
• convert the above set of plain-text files into a SQLite database (named guladh.db)
• display the content of the database
• perform queries on the database
• export some special plain-text files (formsense+POS, formsense+FGBlemma), for use in other programs
Whenever Program Lemmatize adds a new lemma to the plain-text files, the necessary conversion into an SQLite database
is automatically performed by an invisible call to Program Lexicon. If you make manual changes to the plain-text files, you
must run Program Lexicon and its convert option before the changes become effective.
The database is not used directly by Tobar na Gaedhilge. Instead, when making a lemmatized index (Program Setup) or during
retrieval (Program Tobar), special files (SenseToPOSList.dat, SenseToFGBList.dat) are used, which are exported from Program Lexicon.
When Program Lexicon is run implicitly from Program Lemmatize, these files are not generated. When Program Lexicon is run by itself,
the option is given to generate these files, and this must be done before using them in Tobar na Gaedhilge.
Ciarán Ó Duibhín
Úraithe 2024/07/31
Clár cinn / Home page / Page d'accueil / Hauptseite / Главная страница