What it does
Tobar na Gaedhilge is a searchable textbase of high-quality 20th-century Gaelic texts (mostly Irish, with some Scottish), best described as "continuity Gaelic" in its several naturally-occurring varieties. The textbase contains over 5.8 million words of Gaelic text, and is freely downloadable for installation on a personal computer under MS Windows. After the software is installed, a word may be requested, and examples of its use may be viewed. The purpose of Tobar na Gaedhilge is to allow the texts to be used as a lexical and grammatical resource. To protect the rights of the authors and publishers, the texts are not made available in continuously readable form.
A summary of some of the user-friendly features of the system: • texts are grouped into collections, each for a particular naturally-occurring variety of Gaelic • a complete pick-list of the words in the chosen collection is presented on-screen • words are listed with initial mutations removed, eg. fear, fhear, bhfear are all grouped under fear; while athair, h-athair, n-athair, t-athair are all grouped under athair • words are listed with enclitics separated, eg. d'ith is listed under d' and ith; agamsa is listed under agam and sa • apostrophes are distinguished from quotation marks and are retained in words, eg. 'ach is not conflated with ach • words split at end-of-line in the original text are rejoined in the list, where appropriate • collocations of words may be found, either in immediate sequence, or within a sentence-like unit • an analytic approach is taken to compound words, eg. sean-amhrán is treated as a sequence of sean, -, and amhrán • accents may optionally be disregarded, eg. comhartha may be made to retrieve both comhartha and cómhartha • an asterisk may be used as a wild-card to match all or part of a word, eg. beir* or *stin These features were designed into the program from the beginning, and the texts were prepared with the markup necessary to support them. This product is completely free of adware, spyware or other harmful inclusions. |
In addition, the textbase contains translations of some of the Gaelic texts into or from English, French, German and Russian. These will be collectively referred to as other languages. Retrieved segments of Gaelic text may be displayed in other languages, if the translations are stored; and further, the words occurring in the stored texts in other languages may be used to search the textbase.
• What the results may look like
We may begin by finding sentences containing a selected Gaelic word, and we will do one example each from the Munster, Connacht and Scottish texts. Thereafter, we will draw our examples from the Ulster texts, which form by far the largest part of the material stored. (We welcome contributions of Munster, Connacht or Scottish texts.)
Note that some of these screenshots may be from pre-current versions of the system.
Figure 1: We looked for examples in the Munster texts of the word cábóg (a country person). We found two examples in Pádraig Ua Maoileoin, Na hAird Ó Thuaidh, and we show the second example here. Page and line reference is given to the published book.
In passing, a small thing to notice in this example is the hyphen in "sráid-eanna". This shows that the word was hyphenated (at the end of a line) in the original book. But the hyphen does not interfere with indexing the sentence, as use of the Innéacsáilte display mode will confirm.
The navigation panel (at upper right) allows us to move among the retrieved examples. The panel at the lower right allows the form in which the sentence is displayed to be modified. All the options are described under Figure 6 below.
Figure 2: We looked for examples in the Connacht texts of the sequence of words lúb gaoil (blood relationship), and we found two examples, both in Séamus Mag Uidhir, Fánaidheacht i gConndae Mhuigheo. We show the first sentence here.
Figure 3: We looked for examples in the Scottish texts of words beginning with càr. With a request like this for words matching a general pattern, clearly more than one word may match — we refer to this situation as matching a disjunction of words. The sentences for each matching word are presented (over all the relevant books) before presenting the sentences for the next matching word. Here, we show a sentence containing the matching word càraich (fix).
Besides viewing complete sentences as above (Display option: Abairteacha), two other ways of viewing results are provided, which are more compact for large quantities of text. Output from these methods is shown in the next two figures.
Figure 4: Uses of the word saoghal (life) shown as its frequency (Display option: Minicidheacht) in the various Ulster texts.
Scrolling may be necessary to reveal the information for all the texts. If the request matched a disjunction of words, a menu option AthFhocal (next word) will proceed to the next matching word. A menu option Réidh (finished) leaves the displayed results and returns to the search screen.
Figure 5: A keyword-in-context index (Display option: KWIC) of the word-form athair (father) in Séamus 'ac Grianna, Thiar i dTír Chonaill. The first batch of 43 occurrences is shown.
The navigation panel at the upper right allows several options. First of all, we may move up and down through the concordance lines, which are in screenfuls of a size which depends on the size of the window (43 lines in this illustration). We may move down a screenful (Síos), up a screenful (Suas), to start of text (Suas go bárr), to end of text (Síos go bun). However, these options will not take us into another text, nor (if the request matched several word in a disjunction) will they take us into another matching word; they move only within the current text, for the current matching word.
Second, we may move, forwards only, through the entire collection: AthShlámán (next batch of examples from the current book), AthLeabhar (next book for the current matching word), AthFhocal (next matching word, if the request matched several in disjunction), Réidh (finished). If we wish to examine all the examples from the beginning again, we need only choose Réidh and then click OK without altering our previous choices. This second set of options are provided with keyboard shortcuts, which may prove convenient for repeated use.
Cóipeáil (copy) copies the current display to a textfile, which by default, is called samplaí.txt and is placed in the My Documents folder, and the copied material is appended to it. Comhad Cóipeála (copy file) allows the name and location of the file to be changed, and also the mode from append to overwrite (but it will revert to appending after overwriting once).
Réidh (finished) leaves the displayed results and returns to the search screen.
Figure 6: Returning now to display by sentences, we give an example of the word oidhreógach (ice) from the Ulster texts. The example shown is from Seosamh 'ac Grianna, Pádraic Ó Conaire agus Aistí Eile.
The panel at the lower right allows the text of the sentence to be shown in a choice of ways: plain and uncorrected (Foillsighthe); plain but corrected (Lom); including mark-up (Marcáilte); or as a list of the words by which it is indexed (Innéacsáilte). The Innéacsáilte option allows you to see which index terms will fetch this sentence. This will allow you to examine how the text has been tokenized and indexed. The panel also controls the display of the sentence in other languages, when this is possible (see later) — any language with its name shown in in italics is unavailable for this sentence. The choice of display modes applies only to the first language, i.e. the language of the index used; for other languages, the text is shown plain.
To explain what is meant by "correction": this is limited to obvious errors. An isolated spelling variant which is at odds with consistent usage in the rest of the same book may also be corrected; but beyond this, we make no attempt at normalization of forms which are not clearly in error. Corrections to Gaelic text may also include restoring the wording of the manuscript where known. And correction of a text in a language other than Gaelic may be used to bring that text closer to the edition used by the Gaelic translater. None of these correction processes can be guaranteed to have been applied exhaustively. The corrections are applied to the indexes; and to the displayed sentences under the Lom option. The uncorrected book text will always be displayed under the Foillsighthe option.
As an example, the misprint cómhhartha occurs in Ben-Hur on page 337, and should clearly be corrected and indexed as cómhartha. So if we search for cómhartha, we will find the relevant sentence, and if we view it under the default Lom option, we will see the token as cómhartha. But if viewed under the Foillsighthe option, we will see it, uncorrected, as cómhhartha. Under the Marcáilte option, we will see the complete markup as cómh[h]artha. And under the Innéacsáilte option, we will see it confirmed that the token is indexed as cómhartha.
The navigation panel at the upper right allows several options.
First of all, we may move around the examples. We may move down a sentence (Síos), up a sentence (Suas), to start of text (Suas go bárr), to end of text (Síos go bun). However, these options will not take us into another text, nor (if the request matched several words in a disjunction) will they take us into another matching word; they move only within the current text, for the current matching word.
Second, we may move, forwards only, through the whole collection: AthAbairt (next sentence for the current book), AthLeabhar (next book for the current matching word), AthFhocal (next matching word, if the request matched several in disjunction), Réidh (finished). If we wish to examine all the examples from the beginning again, we need only choose Réidh and then click OK without altering our previous choices. This second set of options are provided with keyboard shortcuts, which may prove convenient for repeated use.
Cóipeáil (copy) copies the current display to a textfile, which by default, is called samplaí.txt and is placed in the My Documents folder, and the copied material is appended to it. Comhad Cóipeála (copy file) allows the name and location of the file to be changed, and also the mode from append to overwrite (but it will revert to appending after the first copy). The darker elongated panel shown above is the result of clicking Comhad Cóipeála.
Réidh (finished) leaves the displayed results and returns to the search screen.
Quirks of the navigation panel (the panel headed "Comhad cóipeála")The navigation panel is implemented as a "pop-up", which causes it to behave differently from other panels in some minor ways. While the navigation panel is on display, there are some situations in which it may vanish — for example, an accidental mouse click by the user. In most such situations it automatically re-appears immediately. If it does not re-appear, clicking on almost any part of the Tobar window will reinstate it. Pressing the left mouse button on the TITLE BAR of the Tobar window makes the navigation window vanish, and it will not re-appear if you release the mouse button quickly — this has its uses, as we'll see in a moment. But if you hold the mouse button down on the title bar for more than half a second or so before releasing it, the navigation window re-appears automatically on releasing the button. If you try to MOVE the Tobar window by dragging on the title bar, the navigation panel will vanish, but will re-appear when you finish dragging and you release the mouse button. If you try to RESIZE the Tobar window, while the navigation panel is displayed, you will find that you cannot. But you can make the navigation panel vanish (eg. by a quick click on the title bar, as above), and you can then resize the window. The navigation panel will re-appear when you finish resizing. |
Multi-word retrievals may require the creation of workfiles. These will be placed in the temporary workspace folder, if possible. In the unlikely event that this is not possible, you will be prompted for the name of a folder to hold workfiles; you could, for example, place them on a memory stick.
In displaying results by sentence, the translation of the sentence into other languages may be shown, if it is stored. To do this, tick the names of the desired languages, visible on the sentence-level displays above, labelled Béarla (English) and Fraincis (French) and Gearmáinis (German) and Rúisis (Russian). If the name of the language is italicised, this means that a translation in that language is not available for the sentence to be displayed. At present, translations or originals are held only for texts in the Ulaidh collection, and only for a proportion of this: English, 2.27M tokens; French, 0.87M tokens; German, 0.74M tokens; Russian, 0.09M tokens
Figure 7: Sentences containing the word creafadaigh (shaking); the first of two examples found in Seosamh 'ac Grianna, Seideán Bruithne/Amy Foster. English and French translations are available and are shown.
When a sentence is shown in several languages, the choice of display mode between plain (Lom); including mark-up (Marcáilte); as a list of the word-forms by which it is indexed (Innéacsáilte); or plain and uncorrected exactly as printed (Foillsighte) applies only to the language of the index used. In other languages the sentence is shown plain.
The page and line number is shown for the displayed sentence, and this has recently been extended to apply to each language in which the sentence is displayed, rather than only to the first language. The page and line number are taken from the edition used, and they can be displayed only if the computer-readable version of the text is paginated — see the list of texts below for this information. All the Gaelic texts are paginated, but some in other languages, which have been obtained from various sources, may not be paginated.
• What Gaelic texts can be searched
Gaelic is found in several slightly different forms, and the texts are organized into collections to reflect this and to keep each collection fairly homogeneous in language. The five collections supplied are (as of 2021/05/11):
- Ulaidh (Ulster) Index: Gaedhilg. 17 authors; 73 books; 56,604 word-forms; 5,004,177 word-tokens
- Connachta Index: Gaedhilge. 8 authors; 10 books; 22,983 word-forms; 406,315 word-tokens
- Mumhain (Munster) Index: Gaolainn. 6 authors; 6 books; 13,375 word-forms; 226,404 word-tokens
- Alba (Scotland) Index: Gàidhlig. 4 authors; 5 books; 8,024 word-forms; 115,996 word-tokens
- Oirthear (Eastern) Index: Gaodhlag. 2 regions; 3 books; 8,699 word-forms; 104,843 word-tokens
At the present stage of development, the Ulaidh collection is much larger than any of the others.
A different division into collections would, of course, be possible.
The identities of the texts in all collections are listed in full below.
Searching may be restricted to any chosen subset of the texts of a collection, by deselecting temporarily individual texts or authors.
Each collection has a pre-compiled index associated with it, made from the words found in the relevant books, and in which requests for words are looked up. The statistics just given for the collections refer to these indexes — indexes of words found in the Gaelic texts. This does not mean that the words in the index are exclusively Gaelic words; rather they will reflect faithfully what the Gaelic texts contain.
Much more detail will be given later about what may be found in the indexes.
Figure 8: This is the program's opening screen, and the first task is to choose the desired collection and index.
The program should show a list of the available collections, as in blue above — together with some statistics of the highlighted collection (below) and a pick list of its index (on the right). If there are no collections listed, you may be in the wrong folder, and you can browse (using Cuirtear Lorg) to a different folder. The indexes available to the highlighted collection are shown in brown. When you have marked the desired collection and chosen your index to it, click on Isteach to enter the collection/index.
Amach is to exit the program.
Treoir is for help.
• Requesting Gaelic words
When a collection has been selected, and Isteach clicked, the display changes to that shown in the next figure, which allows you to type words, among other things.
Figure 9: Requesting Gaelic words.
Before entering our own words, however, first notice that this screen allows you to go back and change to a different collection and index, by using the Athruigh button on the Cnuasacht panel. And also, that you may see which authors and books are included in the current collection and index by using the Athruigh button on the Leabharthaí panel, and you may choose to select temporarily a subset of those books. (Note that selecting a subset of a collection is not reflected in the pick-list of words, which remains that for the whole collection.)
From the Radharc panel, you may choose your display mode for the results: frequencies (Minicidheacht), a keyword-in-context concordance (KWIC) or sentences (Abairteacha). Samples of each form of results have already been shown above.
And now we come to the Focal panel, where the desired word or words may be typed into the box provided, or may be inserted there by double-clicking them on the pick list, which is a displayed segment of the collection's index. The pick list accommodates itself to the existing contents of the box, as a guide to what words are available.
If typing into the box, any accented character should be pre-composed, not a combination — e.g. type á as normal, NOT a followed by a combining acute accent.
As an alternative to typing it into the box, a search word may also be chosen by double-clicking it on the pick list, or it may be pasted from the Windows clipboard (Ctrl/V). The new search word will be appended to anything already in the box (and which is not selected); if the existing material in the box is already selected — which is the default — the new search word will replace it. TIP: when working between the word-form box and the picklist, you can cycle the focus around these and the other screen items by repeatedly pressing the TAB key. While an item has focus, any content which is selected will be visibly highlighted.
In the word-form box you may put:
• a word, such as oidhreógach or saoghal or athair, as used in our
previous examples
• two or more words occurring together, either consecutively or within
the same sentence (eg. lúb gaoil)
• any word may contain a wild-card (*), that is,
an asterisk which matches any number of letters, including none. For example,
all words with a particular stem may be sought (eg. beir*), or all words with a particular termination (eg.
*stin).
If the words box contains more than one word (ie. there is a space within it), you are asked to choose between seeking the words directly adjacent and in the given order; or within the same sentence in any order.
It is even possible to give one or more of the words as simply the asterisk (*), which matches any word; the search is then assumed to be a consecutive one. (But avoid giving * as the final word, as the search will be slow.) As we will see below (under demutation) a hyphen is, in most circumstances, counted as a separate word, so search for sean-bhean as three words: sean - bean (as well as sean bean and seanbhean to cover any unhyphenated instances).
You may tick the Gan beinn ar an tsíneadh fhada
checkbox if you want to include words which differ from that requested
only by the presence or absence of an accent, eg. comhradh with this checkbox ticked will match comhrádh, cómhradh and cómhrádh as well. For applicable languages, ticking this
box also includes words which differ from that requested:
• by the presence of ANY accent;
• by a difference of CASE, though case differences are already removed in forms indexes other than German;
• by the intrusion of certain non-alphanumeric characters, such as £, %, period (indicating abbreviation),
apostrophe (indicating elision), hyphen (indicating anonymisation), etc. For example, 2
will retrieve also £2 or 2% or 2° or
2½; but note that many Gaelic words containing apostrophes or hyphens are already
treated as compounds and indexed as two or more separate parts, as explained under decompounding below.
To type accented vowels, use your normal method of doing this under Windows. For information about keying accented letters under Windows, look here, or see the section "Keyboards layouts" near the end of this file. (But you will not require the support for dotted consonants offered by these keyboards, as lenition is always indicated by suffixing the letter h in Tobar na Gaedhilge.) Your method of typing accented characters should result in precomposed characters, as most methods do, rather than in separate combining accents.
When all this is complete, you may click the OK button to produce the results.
Further hints on the selection of search words will come in the next section of this document.
• More about what to search for in a Gaelic index
Here are some pointers regarding what kinds of words are worth requesting.
When a word is requested, it is matched against a pre-compiled index of words from the chosen collection. A scrolling alphabetic listing of the current index is shown, and will indicate what words are available. For Gaelic, this index consists of words which are aggregated in a number of ways to increase coverage:
• lowercased: the words in a Gaelic index have been converted to lowercase by replacing any capital letters by small letters; this even applies to proper names. Any capital letters you include in your request will also be so converted.
• decliticised: common enclitics, such as d' in d'ól, or 's in 'seadh, or -sa in agam-sa, are treated as separate words in the index (d' + ól; 's + eadh; agam + - + sa), and should also be detached in your request. Enclitics are normally signalled in running text by a hyphen or an apostrophe. But when there is no overt signal in similar cases (eg. agamsa, seadh), the splitting in the index will have been performed manually and is unlikely to have been exhaustive.
A number of common contracted words have been indexed under their parts, e.g. 'na (from ina) under ' and a; 'na (from chun an) under 'n and a'; 'na (from chun na) under ' and na; ab (from a ba) under a and b, or under a and b'; gurab under gur and a and b; and many other similar cases. This aspect is to be made more rigorous.
• decompounded: very few words containing a hyphen have been admitted to the indexes — a list of these can be obtained by searching for *-*. Rather, most hyphenated words have been treated atomistically in the indexes, and are found by seeking their parts, including the hyphen, eg. leith-phighinn by seeking the three items leith and - and pighinn, with checking of the "consecutive" option ("Díreach i ndiaidh a chéile").
• demutated: initial mutations are removed from words in the index; so, for example, fear, fhear and bhfear are all indexed as fear, while t-olc, n-olc and holc are all indexed as olc—but, where the mutation is permanent, it is retained, e.g. chugam, thart (in one of its senses), (go) dtí. You may have noticed the benefits of demutation and decliticisation in our athair example above. An initial mutation does not leave any trace in the index; and this is also true of any hyphen which is nothing but part of an initial mutation. When typing words of Gaelic to be searched for, remember to remove initial mutations, unless they are a permanent part of the word. Removal of initial mutations may seem counter-intuitive when requesting a sequence of words (eg. ár athair), but it is nonetheless required.
But the words in a Gaelic index are not lemmatized, i.e. terminally inflected forms, such as fear, fir, feara, must be searched for separately—although the wild card may often be used to advantage to retrieve the several related forms.
Finally, note that the index which will be searched is based on the words of the text which have been subjected to a limited and controlled degree of correction, as explained just below Figure 6 above.
• Searching in other languages
As well as searching the textbase for Gaelic words, indexes have been created in four additional languages, based on the words found in translations or originals in these languages of some of the Gaelic texts. As stated earlier, these translations/originals are at present confined to the Ulaidh collection, and cover only a proportion of it: Béarla (English), 2.27M tokens; Fraincis (French), 0.87M tokens; Gearmáinis (German), 0.54M tokens. There is also a tiny amount of Russian, 0.09M tokens.
Generally, such texts have been independently translated into Gaelic and into the other non-original languages; the original language in most such cases has been English, but there are examples of French (La Terre qui Meurt; Pêcheur d'Islande) and of Russian (Записки охотника). Translation of a Gaelic original is found only into Russian (extracts from Ó Neamh go h-Árainn and from Fallaing Shíoda).
Note that the Gaelic text is considered pivotal. Consequently, substantial amounts of material absent from a Gaelic text will not be included in other language versions of that text either. For example, the Gaelic translation of Turgenev's Записки охотника contains only about half the stories in the original, and only these stories will be included in other languages. Also, in the other language collections, texts are grouped according to the author or translater of the Gaelic version.
Figure 10: Search of the English index of the Ulaidh collection for the word bunch. An example is shown from Ben-Hur, and the English and Gaelic and French and German of the sentence is displayed.
Figure 11: Search of the French index of the Ulaidh collection for the word accroché. An example is shown from Iascaire Inse Tuile, and the French and Gaelic and English and German of the sentence is displayed.
Figure 12: Search of the German index of the Ulaidh collection for the word Knurren. An example is shown from Scairt an Dúthchais, and the German and Gaelic and English and French of the sentence is displayed.
Figure 12a: Search of the Russian index of the Ulaidh collection for the word лошадь. An example is shown from Scéalta Sealgaire, and the Russian and Gaelic and English of the sentence is displayed.
To search using another language, select the Ulaidh collection, and then the appropriate language. You will then have a further choice between Foirmeacha (word-forms) and Lemmata (lemma-types), because a rough and ready lemmatization has been applied to the English, French, German and Russian texts, resulting in two indexes for each of these languages. The statistics for these indexes are, at 2021/05/11:
- Ulaidh (Ulster) Index: Béarla (foirmeacha). 7 Gaelic writers; 22 books; 45,907 word-forms; 2,274,171 word-tokens
- Index: Béarla (lemmata). 7 Gaelic writers; 22 books; 35,440 lemma-forms; 2,265,810 lemma-tokens
- Index: Fraincis (foirmeacha). 3 Gaelic writers; 9 books; 34,719 word-forms; 869,640 word-tokens
- Index: Fraincis (lemmata). 3 Gaelic writers; 9 books; 16,513 lemma-forms; 867,812 lemma-tokens
- Index: Gearmáinis (foirmeacha). 4 Gaelic writers; 9 books; 49,011 word-forms; 740,950 word-tokens
- Index: Gearmáinis (lemmata). 4 Gaelic writers; 9 books; 30,069 lemma-forms; 745,422 lemma-tokens
- Index: Rúisis (foirmeacha). 3 Gaelic writers; 4 parts of books; 21,544 word-forms; 92,991 word-tokens
- Index: Rúisis (lemmata). 3 Gaelic writers; 4 parts of books; 11,350 lemma-forms; 92,908 lemma-tokens
The lemma-token counts will differ slightly from the form token-counts for the same texts. Among the reasons for this, a single word-form token may give rise to two lemma tokens, eg. the English form "cannot" gives lemmas "can" and "not"; or the German form "im" gives lemmas "in" and "die". Also, lemma indexes exclude "foreign" words.
The word-form indexes for the additional languages are lowercased, even for proper names, except for German, where the initial letter of a noun (or a name) retains its case — actually, the initial letter takes its case from the lemma we have assigned to it, which is usually the same thing (lemmatization is discussed below). Enclitics are separated (eg. English 's, 've, n't, French l', m', German 's (gibt's), 'n (so'n)). Hyphenated words are generally decomponded, eg. French garde-robe; but this policy has not been consistently applied to English, where eg. decompounded water-tight is found as well as unitary water-tight and watertight. In the German index, decompounding of (hyphenless) words, eg. weitergehen, has not been attempted.
When typing a word request into the forms index of an additional language, any uppercase letters are automatically converted to lowercase, except for the German index, where the initial letter of a word remains as typed.
Those searches used a word-forms index, but for these additional languages there are also the lemma-forms indexes. Using the lemmatized English index, a request for the man will match the words man or men; while, using the French lemmatized index, a request for homme will match the words homme or hommes; or using the German lemmatized index, a request for Mann will match words Mann, Mannes, Manne, Männer, Männern. (There is no immediate prospect of a Gaelic lemmatized index.)
Figure 13: A KWIC list of the examples of the lemma listen in Gadaidheacht le Láimh Láidir, according to our English lemmatized index. The corresponding material, in Gaelic and any other languages in which it is available, may be inspected, one example at a time, in the sentence display mode.
Figure 14: A KWIC list of the examples of the lemma abandonner (to abandon) in Ben-Hur, according to our lemmatized French index. The corresponding material, in Gaelic and any other languages in which it is available, may be inspected, one example at a time, in the sentence display mode.
Figure 15: A KWIC list of the examples of the lemma Baum (tree) in Scairt an Dúthchais, according to our lemmatized German index. The corresponding material, in Gaelic and any other languages in which it is available, may be inspected, one example at a time, in the sentence display mode.
Figure 15a: A KWIC list of the examples of the lemma тёмный (dark) in Scéalta Sealgaire, according to our lemmatized Russian index. The corresponding material, in Gaelic and any other languages in which it is available, may be inspected, one example at a time, in the sentence display mode.
It is important to understand, however, that our lemmatization of English and French and German and Russian has been performed automatically, using the Tree Tagger. This software is among the best of its kind, and lemmatization would have been impractical without it, but, as with all statistical operations, a percentage of errors is inevitable, and some remain despite much manual post-checking. The Russian indexes have further benefitted from the use of Sharoff's lemmatisation tool, and of Usachev's morphology file and of the morfer.ru site, as described here.
In making our lemmatized indexes, an initial capital has been retained in some words, mostly names (as well as for nouns in the German index); but when typing a word-request into a lemma index, no changes are made to the case typed. Therefore it will make a difference to the results whether the letters you type in your request to a lemmatized index are small letters or capitals. Keep an eye on the scrolling alphabetic list for guidance on what lemmas are available and when you should use a capital letter.
Also, foreign words are excluded from lemma indexes, and this is the main cause of the discrepancies between the counts of word-tokens and lemma-tokens in the indexes. It also makes the lemma indexes slightly more "mono-lingual" than the form indexes.
The description given above of the function of the Gan beinn ar an tsíneadh fhada (accent-insensitive) box applies equally to the other languages. The main practical effect of ticking this box is to include words differing in accents when using either a forms or a lemmas index; and to include words differing in case when using either a lemmas index or the German forms index.
When displaying a sentence found through a lemmatized index, the Innéacsáilte display option shows the sentence in the first language as a list of the lemmas with which it is indexed.
Many lemmas are ambiguous, e.g. in English: pack or stamp or well or lie or bound or back; or in French: pas or tendre or vague or fin. Ideally, we would like to retrieve only the desired sense of an ambiguous lemma, and, since version 1.5, we have tried to separate the senses by part of speech, using four broad categories of N (noun), V (verb), J (adjective) and Z (other). Thus for example a request for English lemma well will be asked to choose between N, V and Z; a request for French lemma vague will be asked to choose between J and N. This may help in many cases, but not in others; for the English lemma lie, for example, a more useful division would be into recline and untruth, rather then into noun and verb. Further changes in this area may be expected in future versions.
• Translation equivalents
A related innovation is the calculation of translation equivalents. Given a word in the source language, this consists of a listing of the relatively most common words in the corresponding segments of the target language. This will clearly be more effective using lemmas than using words, so it is offered only for languages with a lemmatized index (English, French, German, Russian). At present the target produced is a list of unlemmatized word-forms (Gaelic, English, French, German or Russian) — lemmas would be preferable here too, but are not yet available. This technique has potential, but is limited at present by the amount of parallel text available for each language pair (the numbers below are the numbers of Gaelic words in the shared texts):
English | French | German | Russian | |
Gaelic | 2,270,000 | 903,000 | 919,000 | 142,000 |
English | 903,000 | 919,000 | 128,000 | |
French | 500,000 | 45,000 | ||
German | 45,000 |
With the present amounts of text the results range from interesting to comical.
Where a lemmatized source index exists (languages other than Gaelic), a fourth output display mode is available, named Freagar-fhocla. The calculation may take a few moments. The resulting display is a list of target-language word-forms, each accompanied by a score, and sorted on these scores (the user may have it re-sorted alphabetically on the words themselves). The scores — which are not raw word counts — may range from 99,999,999 down to 100,000, and measure how common the target-language word-form is in the neighbourhood of the source-language lemma in comparison with the whole of the target language corpus. Any of the supported languages, including Gaelic, may be used as the target language.
Note: if a subset of texts is selected, only that subset contributes to the calculation.
Figure 15: Search for Gaelic words collocating with the English lemma child, in the Ulster texts.
The chosen source-language lemma defines a set of sentences in the source-language corpus — those sentences in which it occurs — and a corresponding set of sentences in the target-language corpus — those sentences which translate them. This "select part" of the target-language corpus is studied, looking for words (freagar-fhocla, word-equivalents) which are more frequent there than in the target-language corpus on average.
If the source-language word is uncommon (read: selects less than one-thousandth part of the source language corpus), a warning is issued that the results may not be statistically useful, but no impediment is placed on calculating them.
If a target-language word turns out to be equally frequent in the select part as on average, it is given a score of 100,000; if it turns out to be twice as frequent in the select part as on average, it is assigned 200,000; and so on. Words less frequent in the select part than on average are discarded as uninteresting, so that 100,000 is the minimum score among those retained. At the other end of the range, the score 99,999,999 is assigned to any word which is 100 times or more as common in the select part as on average.
Even if a word falls within the range 100000..99999999, it is still omitted from the displayed list if its absolute frequency is small. This is intended to overcome "accidental" collocations, which will disappear naturally as more text is added, but may mask more significant data while they remain. A suitable empirical lower cutoff for absolute frequency of a word-equivalent is found to be the square root of one-tenth of the frequency of the source-language word.
Results are still poor enough with the amount of text available, but will improve as the quantity increases. Even at the present time, however, it may be of interest to input English lemmas from the following list, and to compare the results with the content of existing English–Irish dictionaries, noting what is found in the dictionaries but absent in the corpus, as well as what relevant equivalents are found in the corpus but not in the dictionaries: smoke, minute, also, yet, dog, ice, bee, garden, help, interest, gravel, cave, busy, cell, kitchen, open.