The text format used in Tobar na Gaedhilge
Please note that the Gaelic texts in Tobar na Gaedhilge have been digitised to provide access to the
richness of the several varieties of continuity Gaelic. While people are free to use the texts in any way they wish,
we beg to point out that to distort them by converting them into another form of Irish for extra-linguistic reasons
would run counter to the purpose for which they have been created.
This document, started on 2018/02/25, describes the format of texts prepared for use with Tobar na Gaedhilge.
These texts comprise, besides Gaelic texts, a quantity of parallel texts in English, French, German and Russian.
Draft details of preparing and lemmatising English, French, German and Russian texts will be given in a separate document.
Here the main focus will be on Gaelic texts, mostly from Ireland but including some from Scotland, for which the treatment will vary slightly.
The compact format described here makes it easily possible to work directly with running text, even in a text editor,
while providing all the necessary richness of markup. It would be possible to write a converter from this text format to a XML-compatible
scheme, but I have not done so, because I have not found off-the-shelf XML-aware retrieval software which could reproduce the functionality
of the Tobar na Gaedhilge software acting on text in the present format. Some comments on conversion to XML are given at the end.
Part I (How to interpret marked-up text) explains how to interpret the various mark-up symbols which will be encountered in examining the texts.
Part II (How to prepare marked-up text) describes how to prepare texts in the required format, including the use of utility programs.
Note: the matrix language of a text is the main language; interpolations in other languages are referred to as foreign matter.
For example, for a Gaelic translation of a text originally written in English, the matrix language is Gaelic; foreign matter in this translation may be
in any language other than Gaelic, including English.
I. How to interpret marked-up text.
Contents
horizontal and vertical spacing; end of line (|); references lines (<…>); types of text unit
footnotes (‰)
lenition; quotation marks; apostrophe; dashes; ellipsis; ampersand; bullet
degree sign (°); prime ('); fraction slash (/); metrical breve (?); mathematical italics; superscripts
error correction ({...}; [...])
dubious (¬); interesting (#, $)
typographic emphasis (@...@)
names (‡...‡; †...†)
foreign matter (\...\)
segmentation of tokens (+)
joining of tokens (_)
demutation (^)
broken word at end of line
if markup characters occur in text
postags and lemmas («.../....»)
transient characters
how to remove markup
Text structure
The texts are held as plain-text files, encoded in unicode, and saved in utf8 form with byte-order-mark,
and with CR/LF as line break.
Accented characters are pre-composed (such as á, U+00E1), not combinations (such as a, U+0061 followed by combining acute, U+0301).
Certain typographic features are unavailable in plain-text, such as small caps, bold style, italic style, or variation of typesize.
Where considered important, they may be explicitly marked-up.
Each line of printed text forms a line of the file, and each page is preceded by an additional reference line,
giving the number of the page which follows, eg. <L 25>.
A line containing <L 0> introduces a piece of text for which no page number could be seen or inferred —
this might apply, for example, to the front matter of a book.
It might also apply to an entire text which is unpaginated, ie. not made up into pages and lines according to some printed edition.
See also footnotes, which are assigned to a separate series of pages.
At the beginning of the text, or elsewhere, other reference lines may be included, such as
• <N LU034> (giving text identification code)
• <B 1930> (giving year)
• <C Prós> or <C Fil.> (indicating switches of genre).
These are not used by existing applications, and have not been consistently included.
Reference lines are distinguished by having the less-than sign < in the first position.
All leading spaces on a line have been removed.
All left indentation and centering has been removed. For trailing spaces, see end-of-line marking.
A blank line is left after a paragraph. Two blank lines are left after a unit at the next higher level, three blanks lines after a unit at the next higher level again, etc.
Whenever these blank lines occur at the same place in the text as a reference line, they have been placed before the reference line, not after it.
Structural levels above the paragraph are text-dependent, and applications currently make no use of them, other than to allow print-out of text by unit (Program Collate).
Consistency in marking these higher levels is being gradually addressed.
Two horizontal spaces are left between sentences.
At the end of every text line — excluding reference lines and blank lines — a vertical line | character (U+007C) is to be found, after the appropriate number of spaces.
The appropriate number of spaces will in most cases be one; but at the end of a sentence, two; at the end of a paragraph, three; and so on.
It may be zero after a hyphenated word (whether or not the hyphen is a permanent one, or just the result of breaking a word at the line-end).
At first sight, for units at the paragraph level and above, this horizontal spacing appears to duplicate the vertical spacing with blank lines.
And indeed the horizontal spacing is initially generated automatically from the vertical spacing, but horizontal spacing is subsequently adjusted, for a number of purposes.
For example, a very short sentence, which would make a poor retrieval context when displayed in isolation, may have been attached to a neighbour to make a single retrieval unit
by leaving only one space between them. Conversely, a very long sentence may have been split into several retrieval units, by embedding two spaces.
Similar adjustments of the horizontal spacing serve to align retrieval units between parallel texts in different languages.
For such parallel texts, the sentence/paragraph etc structure of the Gaelic text is reproduced in the other languages, by alterations where necessary in the horizontal spacing,
which therefore may not reflect the structure of the printed text in the other language. In summary, the blank lines continue to reflect the organisation of the text
on the printed page for every language version of a text, while the horizontal spacing now defines cross-linguistic retrieval contexts.
Note carefully that any changes made to the horizontal spacing will invalidate the alignment with other language versions of a text, where these exist.
Codes may follow the vertical line character, to classify the units ending on that text line.
Note, however, that this information is not used by existing applications.
At any one level, we may expect to find different types of unit occurring, eg. at level one, we might have: prose sentence; line or couplet of poetry; title of level 2 unit.
At level two, we might have: prose paragraph; stanza of poetry; title of level 3 unit. For any line on which a sentence or larger unit ends, the vertical line character may be followed
by a sequence of single-letter type codes, eg.
The first sentence. The second sentence. The start of a third |AA
This line contains the endings of two level 1 units. Each of these endings contributes one code letter after the vertical bar, in left-to-right order.
Some texts include the following type codes, although we reinterate that existing applications make no use of them:
• level 1: A (prose sentence), L (line of poetry), T (title of level 2 unit)
• level 2: P (prose paragraph), V (stanza of poetry), T (title of level 3 unit)
• level 3: R (prose section), T (title of level 4 unit)
• level 4: H (prose "story"), T (title of level 5 unit)
As a further example:
The first sentence. The second sentence, the last in the paragraph. |AAP
This line contains the ending of a level 1 unit and a level 2 unit. The first contributes A and the second AP.
Each contributes a number of codes one less than the associated number of horizontal spaces.
Footnotes; captions
A "per mille" sign ‰ (U+2030) is prefixed to the footnote identifier, both at the reference point in the text, and at the start of the footnote itself.
A footnote identifier is conventionally printed as a superscript, but we encode it at normal size. Note that a footnote identifier need not be numeric.
At the reference point in the text, the footnote identifier (including the ‰ before it: for example ‰1) is suffixed to the token, without space or other word separators intervening.
Punctuation may intervene, as may information concerning POStag and lemma.
The footnote itself is not entered with the page on which it is referenced.
Instead, footnotes are placed on a separate series of pages, with page numbers formed by adding N to the number of the page from which the note is referenced.
For example, if there is a footnote on page 25, we introduce a page <L 25N>, and place there the text of any footnotes referenced from page 25.
These footnote pages are placed at the end of the text, after the regular pages.
The system just described is used for footnotes which are fairly self-contained and not particularly short. A very short footnote, which would make little sense out of its context,
may instead be inserted into the text at the point from which it was referenced, but enclosed in round parentheses, or (if round parentheses are already in use) between dashes.
No footnote identifier is used in this case.
Pictures and diagrams are not included in our texts.
However their captions may be included, in a further series of pages, similar to those which accommodate footnotes.
If the picture or diagram occurs on page 80 (for example), or on an unnumbered page directly following page 80,
the caption is entered on a page <L 80P>, and such pages are placed at the end of the text, after any footnote pages.
Encoding of specific characters
(Gaelic texts only) Lenition is encoded by suffixing h to the lenited consonant, even when the text uses a dot-above diacritic.
Quotation marks, whether single or double, are encoded matching, left and right.
The symbols used for quotation marks depend on the language. For Gaelic and English, the normal choice would be:
double high-6 quote “ (U+201C), double high-9 quote ” (U+201D), single high-6 quote ‘ (U+2018) and single high-9 quote ’ (U+2019).
For German, a common choice would be:
double low-9 quote „ (U+201E), double high-6 quote “ (U+201C), single low-9 quote ‚ (U+201A) and single high-6 quote ‘ (U+2018).
(If the high-6 quotes are mis-displayed here as high-9-reversed, which is unacceptable for German, it is likely that the Lucida Console font
which is intended to show them in this section may not be installed on the computer and is being substituted by Courier New.)
An apostrophe ' (U+0027) indicates an elision of one or more letters, and must be distinguished from a high-9 single quote,
as they require different computational treatment, even though both may have an identical glyph in some fonts.
An apostrophe is word-material, whereas quotation marks are not.
A dash is encoded as unicode — (U+2014) and is distinct from a hyphen, which is - (U+002D).
In tokenising a text, ie. dividing it into words, a hyphen is considered to be word-material while a dash is not part of a word.
A dash is separated from adjacent words by a space on either side, though the space may be omitted if the adjacent character is punctuation.
A dash within a range of numbers, or between two words of equal status (eg. Mason–Dixon line, English–Irish dictionary), is encoded as en-dash, – (U+2013).
An en-dash is considered to be word-material, and is not flanked by spaces. This may not be consistently encoded.
Where a printed dash serves to anonymise the whole or part of a word (e.g. "Mr. H—"), one or more hyphens should be used instead in the computer file, eg. Mr. H--.
A series of two or more dots representing ellipsis is encoded as a single character … (U+2026).
This holds however many dots there are, as long as there are at least two. An ellipsis is not considered word-material, even if there is no space adjacent to it.
An ampersand is encoded as & (U+0026).
This holds also when the ampersand is part of a word, as in &c. The ampersand is word-material.
This is also the encoding for the "tironian et" found in Gaelic script; it looks a little like a figure 7. This holds also when the tironian et is part of word, as in &rl.
A line of symbols — often asterisks or dots (bullets) — across the page, dividing the text into sections,
is encoded as a separate paragraph consisting exactly five bullets ••••• (U+2022).
There will be no text on the line, other than the five bullets.
There should be no space between the bullets.
The line is made a separate unit by having at least one blank line before and after it.
Miscellaneous unicode characters
There are other cases where several unicode characters have an identical or very similar appearance,
and we should clarify which unicode character we have chosen to use, and what we use it for.
Here are some choices which we have made.
• degree sign ° (U+00B0) for degrees; NOT masculine ordinal º (U+00BA) OR superscript zero ° (U+2070)
• prime ' (U+2032) for minutes (of degrees); also for stress mark (O'Growney phonetics); NOT acute accent ´ (U+00B4)
• fraction slash / (U+2044) in fractions, eg. 1/16; also in dates, eg. 4/7/1912; NOT solidus / (U+002F),
except between lemma and POStag.
Other non-routine characters which have been been required include:
• metrical breve ? (U+23D1) for showing poetic metre
• mathematical italics (U+1D434 – U+1D467) for slender consonants in O'Growney phonetics, eg. 𝑑
• right angle (U+2220) in mathematical text
• logical and (U+2227), like an inverted V, in mathematical text
• therefore sign (U+2234), in mathematical text
• superscript one ¹ (U+00B9) to demarcate, in pairs, a section of a translated text taken from an alternative translation of the text; this usually occurs when
the main translation has a gap, or when it seriously re-arranges the order of the original text
• superscript two ² (U+00B2), superscript three ³ (U+00B3), etc. for sections taken from the second, third, etc. alternative translations of a text.
Introduced markup
Text corrections
Matched curly bracket characters { (U+007B) and } (U+007D) enclose material which was not present in the printed work,
but has been inserted as a correction
Matched square bracket characters [ (U+005B) and ] (U+005D) enclose material which was present in the printed work,
but which has been deleted as a correction.
For example, if the book contains the word "fear" which should obviously be "féar", the word may be encoded f{é}[e]ar.
Corrections made in this way ensure that the original text is always recoverable.
Correction markup is used in principle in three main circumstances:
• when an error is obvious and uncontroversial
• when a spelling is at variance with the overwhelming practice in the rest of the same text
• when the manuscript is available, to signal a correction from the published text back to the manuscript; this usage has still to be applied to all but a few texts.
Our policy is not to correct the texts to any external standard, but on the contrary, to preserve any spelling which may be informative as to the pronunciation.
Interesting or doubtful words
The logical-not sign ¬ (U+00AC) is prefixed to any word-token which is suspected of needing further investigation.
This mark-up is a reminder which facilitates later checking of the computer text against the printed work, or against the manuscript, when these become available.
Ideally, this markup should be removed when the uncertainty has been resolved.
There are a number of other legacy prefix characters, which may still be encountered, but were little used, and should probably be completely removed:
• the hash character # (U+0023) may have been prefixed to a word-token which is noticed to be of special lexicographical interest
• the dollar sign $ (U+0024) may have been prefixed to a word-token which is seen to be a borrowing which is orthographically unassimilated.
Typographically-marked text
We have already mentioned that typographic devices, such as small caps, bold, italic, or enlargement, cannot be captured directly in plain-text.
Where preserving such information is important, mark-up may be introduced.
In practice, such typographic devices serve a wide variety of purposes from one text to another,
and the same purposes are also served on occasion by use of quotation marks or by capitalization — features which we DO preserve.
The following rules, based on function rather than form, are unlikely to have been consistently applied.
At-signs @ (U+0040), in pairs, enclose text which is typographically emphasised (generally by capitalisation or boldface).
Matched quotation marks, double or single as context demands, are introduced to enclose text which represents the title of an external object -
e.g. a public house, a ship, a work of literature (often printed in italics).
If the quotation marks are not already present, they are introduced between curly brackets: {“} … {”}
The opening quotation mark should come before any initial mutation.
Proper names (Gaelic texts only)
A double dagger ‡ (U+2021) is placed before the start of the name (whether native or foreign), and another after the end of the name.
For example: ‡Rann na Feirste‡, ‡Seáinín an Dálaigh‡, go ‡h-Éirinn‡.
Notice that the marker is placed at the start of the word, before any mutation.
A dagger † (U+2020) is placed at the start and end of anything found within a name but which is not part of the name.
An example of such a discontinuous name is ‡Beanna† gorm-cheódhacha †Boirche‡.
Many items with initial capitals are not marked as names, e.g. Dia, Aifreann, Samhradh, Nodlaig, an Cogadh Mór, Turas na Croiche, Trí Rann agus Amhrán.
We mark names of places, but not nouns or adjectives derived from them, such as "Éireannach", "Conallach", "Frainncis").
This may change! We exclude individual buildings from name marking, eg. Arc de Triomphe.
An article at the margin of a name is not regarded as part of the name, e.g. an ‡Clochán Liath‡.
If the article is internal to the name, we have no choice but to include it, e.g. ‡Rann na Feirste‡.
We exclude a title like An Tighearna, An Bhean Uasal, Príor, Lord, Sir, Mr., Mrs. as a name component.
This may change! We may exclude a "loose" generic element from the name marking, e.g. Cuan ‡Dhún na nGall‡, Báighe ‡Bheanntraighe‡, ‡Johnny‡ Beag.
Components of a name are not excluded from lexical processing, since a name may contain productive lexical items, or may even be entirely composed of such items, eg. An tAmadán Mór, An Fál Carrach.
We do, however, exclude foreign name elements in Gaelic texts from certain processing, by marking them as foreign.
Foreign names in non-Gaelic texts have not, in general, yet been marked as foreign.
Foreign words
Backslashes \ (U+005C), in pairs, enclose material in languages other than the matrix language of the text (foreign matter).
There is little difficulty in marking foreign matter extending to a whole sentence or more. Single foreign words and short fixed phrases embedded in matrix text
may still be marked foreign even where there is some degree of orthographic assimilation (eg. Hóm Rúl), and — in Gaelic matrix text — where they are subject to initial mutation
(e.g. gnoithe an phledge).
Foreign names are marked as foreign (as well as being marked as names), apart from the most common and thoroughly assimilated cases. (This policy has as yet been applied exhaustively
only to the Ulaidh collection of Gaelic texts.) Foreign names greatly outnumber other foreign items in most texts, and especially in translations.
Nouns or adjectives derived from names marked foreign are themselves marked foreign also, eg. Maraîchine, Poitevin, Vendéan, an t-Idumaen, Coirinteach.
Segmentation of compound words
The plus character + (U+002B) is used to split a compound word into multiple tokens.
For example:
mór+luachach, an+íde, agam+sa
mór+-+luachach, an+-+íde, agam+-+sa
d'+ól, b'+éigin, m'+athair, do+'n, 's+é, '+brath
do+n, s+é, s+a, s+an
a+tá, a+deir, a+dubhairt, a+tchí
a+b, gur+a+b, gur+bh, nár+bh
Joining of tokens
The underscore character _ (U+005F) was formerly used to join two tokens into one.
For example:
dá_ríribh (x3 in AU022, not used in other texts, in which there are many potential examples
cionn_is (x12 in AU022)
a_dh' (x3 in AU022, some in AU012 may have been removed)
This practice is discontinued, and underscores are being replaced in texts by spaces; a better solution is to be sought.
Initial mutations (Gaelic texts only)
The hat character "^" (U+005E) is inserted before any character which should be ignored in tokenisation.
The major use of the hat in Gaelic text is to mark characters which are present as a result of initial mutation.
For example: f^hear, ^b^hfear, ^t^-athair, ^n^-athair, ^hathair
When tokenising the text, the hat and the character immediately following it are dropped.
When displaying text, the hat is dropped but the following character is retained.
Non-initial lenition is not marked-up.
A mutation after a marked-up segmentation is regarded as initial and is marked up, eg. sean+-+c^hró.
Broken words at end-of-line
The hat character may also be used with the hyphen in a word broken at end-of-line.
A transient hyphen should be dropped in rejoining the broken word, and this is achived by prefixing the hyphen by a hat character.
There is however one way in which the hat is not optimal for this case — when displaying text, ^- at end of line should display nothing,
especially if the original line-structure is not being reproduced; whereas ^- will display "-", in line with the use of hat before other characters.
Characters employed as markup
The characters used as markup have a low probability of appearing as themselves in text, but nevertheless this can happen.
Rather than revise the markup scheme and all existing texts when this happens, we replace the newly-appeared character in the text
by one which has not yet occurred.
The following list of characters requiring recoding because they are used as markup is complete at time of writing:
A left square bracket (U+005B) in text is represented by left square bracket with quill ? (U+2045).
A right square bracket (U+005D) in text is represented by right square bracket with quill ? (U+2046).
A plus sign (U+002B) in text is represented by plus with dot over ? (U+2214).
The above recodings should not be visible to an end-user. Applications, such as our retrieval application, should make the necessary changes,
both to the user's request and to the displayed results, so that it appears to the user that they are always dealing with the familiar character.
The plain-text files, of course, contain the recoded characters.
Although the symbols < and > are used in marking-up reference lines,
they may still be freely used in text, provided that < does not occur in the first position on a line.
If the symbols dagger † or double dagger ‡ occur in text as footnote references, they are replaced by
** and *** respectively.
POStags and lemmas
For texts in languages other than Gaelic, a word in the matrix language of the text may have supplementary information appended,
consisting of a lemma and a part-of-speech tag (POStag).
This information has been generated using the TreeTagger, with manual post-editing.
For words which are foreign to the matrix language of the text, no supplementary information is recorded.
Corresponding enhancement of the Gaelic texts is still under development.
Left double angle quotation mark « (U+00AB) and right double angle quotation mark » (U+00BB) are used to enclose supplementary information about a token.
The supplementary information normally takes the form of a lemma and a POStag, separated by a solidus / (U+002F), for example: men«man/NN»
The supplementary information is attached to the end of the token, without space or other word-separator intervening; and before any footnote reference attached to the token.
In Gaelic texts, this supplementary information has been added manually and as yet only sporadically,
and the content of the double angle brackets is normally simply a number, referencing a numbered list of the senses of the wordtype.
When this markup is more complete, it is intended to transform it into the «lemma/POStag» form used for other languages.
Transient characters
We may also note that several characters, which are not expected to occur in text or markup, play a role in processing by the retrieval program:
• the currency sign ¤ (U+00A4) may appear in the retrieval program search box,
where it stands for a Gaelic word which has been completely elided in the written text
• the feminine ordinal indicator ª (U+00AA) and the masculine ordinal indicator º (U+00BA) may be used in intermediate processing
of sentence contexts, where they stand for highlighting-on and highlighting-off respectively.
Removing the markup
Mark-up is added to text to widen the range of possible automated processing, given software which can recognise and act upon the markup.
However if it is desired to remove the markup in order to recover the natural reading appearance of the text, the following global substitutions may be made:
• remove the characters ^ + @ ‡ † \
• remove the character | and anything between it and end of line
• remove the paired characters « and » and anything between them.
• replace left square bracket with quill ? (U+2045) by left square bracket (U+005B)
• replace right square bracket with quill ? (U+2046) by right square bracket (U+005D)
• replace plus with dot over ? (U+2214) by plus sign (U+002B)
II. How to prepare marked-up text.
Note: for anyone wishing to contribute a text, it will suffice to observe the rules regarding page and line structure in the short General section below.
The further rules which follow can be applied centrally.
Here we describe the full process of preparing a text, including the use of utility programs. These programs may be downloaded here.
Some texts may be obtained already digitised, particularly those in languages other than Gaelic.
Such texts will require transformation, using editor programs.
Among other considerations, texts received digitised may be unpaginated. We may paginate them, i.e. restore the page and line structure of some edition,
but we may decide not to do this, either because of the labour involved, or because the preferred edition is unavailable to us.
Unpaginated texts are fully usable, except that material retrieved from them cannot have page and line number displayed.
II.A Digitisation
Initial text layout
Texts may be digitised by keyboarding, by scanning with OCR, or they may be obtained already in digital form.
Scanned or pre-digitised text must be edited into the form about to be described.
With keyboarding many aspects of this form may be implemented at the same time as the text is being created.
General
Text is held in a plain-text file, using unicode encoding, and saved in utf8 form with byte-order-mark.
End-of-line should be indicated by CR/LF.
For accented characters, use only pre-composed (such as á, U+00E1), not combinations (such as a, U+0061 followed by combining acute, U+0301).
The page and line structure of the printed text is retained — one line of the file for one line of the book.
A line is added before each page, giving the page number, e.g.
<L 25>
at the start of page 25. A page number need not be an arabic numeral.
A reference line is distinguished by having the less-than sign < in the first position.
Other categories of reference line may be inserted, at the beginning of the text, or elsewhere, such as
• <N LU034> (giving text identification code)
• <B 1930> (giving year)
• <C Prós> or <C Fil.> (indicating switches of genre).
Reference categories other than L (for page number) are not used by existing applications.
Lines are not numbered explicitly, but are counted automatically.
No leading space should occur on a line. Any indentation or centering is not captured.
Sentences and paragraphs
A text is viewed as a hierarchy of structural units. In prose, the lowest three levels of units are often sentence, paragraph and chapter.
Still higher levels, and even the level of prose chapter (level 3), depend on the text, and are not used by applications at present.
At this stage in text preparation, it is important to ensure only that:
• units at level 1 (eg. sentences) are separated by two horizontal spaces.
• units at level 2 (eg. paragraphs) are separated by one blank line; units at level 3 by two blank lines, etc.
When blank lines co-occur with reference lines, the blank lines should be placed before the reference lines.
When a level 1 unit (eg. sentence) ends at the end of a line, which is not the end of a paragraph, a vertical bar character | (U+007C) is added to the line after the requisite two spaces.
For information: at a later stage in text preparation, a program will be used to add the vertical bar character, after the appropriate number of spaces,
to every text line which does not already have it (but not to reference lines or blank lines).
And then further manual adjustment of the horizontal spacing will permit us to amalgamate or split sentences,
so as to (1) produce more useful retrieval contexts; and (2) align parallel texts in different languages.
In fact, we allow further information on a line after the vertical bar character, but no use is currently made of such information.
Some texts utilise this possibility, which could be used by a potential application to distinguish different types of unit at the same level,
eg. at level one, we might have a prose sentence; a line or couplet of poetry; or a title of level 2 unit.
The following single-letter "type codes" are suggested, although we reinterate that existing applications make no use of them:
• level 1: A (prose sentence), L (line of poetry), T (title of level 2 unit)
• level 2: P (prose paragraph), V (stanza of poetry), T (title of level 3 unit)
• level 3: R (prose section), T (title of level 4 unit)
• level 4: H (prose "story"), T (title of level 5 unit)
This information may be captured, for any line which contains an end of unit at level one or higher, by adding a sequence of these "type codes" after the vertical line
to show the nature of the units ending on that line, in left-to-right order, eg.
The first sentence. The second sentence, ending a paragraph. |AAP
This line contains the endings of a level 1 unit and a level 2 unit. The first contributes one code letter after the vertical bar, and the second contributes
two code letters.
Sections of text are sometimes found separated by a horizontal line, often made up of a sequence of asterisks or bullets.
Such a line is encoded as a sequence of exactly five bullet characters ••••• (U+2022),
There should be no space between the bullets.
The line of bullets is preceded and followed by the required number of blank lines to make it appear as a separate unit of the same level as the units it separates
(normally one blank line before and after, to make it appear as a separate paragraph).
There will be no text on the line, other than the five bullets, and a vertical line character.
Footnotes
A footnote reference in the text — usually but not necessarily a number — is conventionally printed as a superscript. It occurs both at the reference point in the text
(where it is normally attached to the end of a word), and at the start of the footnote itself.
In our texts, the footnote reference should be at normal size (not superscript), and should be prefixed, in both places, by a per-mille character ‰ (U+2030).
In the text, no space should be left between the reference and the word to which it is attached.
Punctuation may intervene, however, between the word and the reference, as may information concerning POStag and lemma.
The content of a footnote (beginning with the per mille character and the reference) is not included in the page on which it is referenced.
Instead we introduce a page with N added to the page number, and place the footnote text there.
For example, if there is a footnote on page 25, we introduce a page <L 25N>, and place there the footnote from page 25.
These N-pages are placed at the end of the text, after the regular pages. The result of this treatment is similar to end-notes.
Footnotes added in translation are generally dropped, as being confined to one version of a parallel text.
The above system is used for footnotes which are fairly self-contained and not particularly short. A very short footnote, which would make little sense out of its context,
may instead be inserted into the text at the point from which it was referenced, but enclosed in round parentheses, or (if round parentheses are already in use) between dashes.
No footnote identifier is used in this case.
Captions
Pictures and diagrams from the printed work are not included in our computerised text.
However their captions may be included, in a further series of pages similar to those which accommodate footnotes.
If the picture or diagram occurs on page 80 (for example), or on an unnumbered page following page 80,
the caption is entered on a page <L 80P>, and such pages are placed at the end of the text, after any footnote pages.
Encoding of specific characters
(Gaelic texts only) Where the text indicates lenition by a dot-above diacritic, instead we use a digraph by suffixation of h.
There would be considerable advantage in using the diacritic throughout, whether the original text used the diacritic or the digraph.
In particular, it would avoid confusion between a lenited Gaelic word and a homograph in another language, e.g. chat — with h, these have to be guarded against.
The diacritic would also result in a more logical alphabetic ordering (as in Dinneen's dictionary).
A systematic change-over to the diacritic is a future possibility, but not contemplated soon.
Note that the choice between diacritic and digraph is independent of the choice between Latin and Gaelic fonts.
Note that encoding text using the diacritic would not preclude displaying that text using the digraph, but the reverse is not true,
due to the greater ambiguity of the digraph.
Quotation marks should be matching, left and right.
The symbols used for quotation marks depend on the language. For Gaelic and English, the normal choice would be:
double high-6 quote “ (U+201C), double high-9 quote ” (U+201D), single high-6 quote ‘ (U+2018) and single high-9 quote ’ (U+2019).
For German, a common choice would be:
double low-9 quote „ (U+201E), double high-6 quote “ (U+201C), single low-9 quote ‚ (U+201A) and single high-6 quote ‘ (U+2018).
An apostrophe ' (U+0027) indicates an elision of one or more letters, and must be distinguished from a single high-9 single quote ’ (U+2019)
as they require different computational treatment, even though both may have an identical glyph in some fonts.
An apostrophe is word-material, whereas quotation marks are not.
A dash is encoded as unicode — (U+2014) and is distinct from a hyphen, which is - (U+002D).
In tokenising a text, ie. dividing it into words, a hyphen is considered to be word-material while a dash is not part of a word.
A dash is separated from adjacent words by a space on either side, though the space may be omitted if the adjacent character is punctuation.
A dash within a range of numbers, or between two words of equal status (eg. Mason–Dixon line, English–Irish dictionary), is encoded as en-dash, – (U+2013).
An en-dash is considered to be word-material, and is not flanked by spaces. This may not be consistenty encoded.
Where a printed dash serves to anonymise the whole or part of a word (e.g. "Mr. H—"), one or more hyphens should be used instead in the computer file, eg. Mr. H--.
A series of two or more dots representing ellipsis is encoded as a single character … (U+2026).
This holds however many dots there are, as long as there are at least two. An ellipsis is not considered word-material, even if there is no space adjacent to it.
An ampersand is encoded as & (U+0026).
This holds also when the ampersand is part of a word, as in &c. The ampersand is word-material.
This is also the encoding for the "tironian et" found in Gaelic script; it looks a little like a figure 7. This holds also when the tironian et is part of word, as in &rl.
When a markup character occurs in text
The characters used as markup have a low probability of appearing as themselves in text, but nevertheless this can happen. When it does, we replace the character in text
by another.
The following list of characters requiring recoding because they are used as markup is complete at time of writing:
A left square bracket (U+005B) in text is represented by left square bracket with quill ? (U+2045).
A right square bracket (U+005D) in text is represented by right square bracket with quill ? (U+2046).
A plus sign (U+002B) in text is represented by plus with dot over ? (U+2214).
These recodings should not be visible to an end-user. Applications, such as our retrieval application, should make the necessary changes,
both to the user's request and to the displayed results, so that it appears to the user that they are always dealing with the familiar character.
The plain-text files, of course, contain the recoded characters.
Although the symbols < and > are used in marking-up reference lines, they may still be freely used in text, provided that <
does not occur in the first position on a line.
Possible errors in text
If something is noticed in text which requires further investigation, a logical-not sign, ¬ (U+00AC) may be dropped into the text as a reminder,
so that the problem can be readily located and dealt with later. An important use of this markup is to identify tokens whose form could
benefit by checking the manuscript (where one survives).
Only uncontroversial errors are identified as such, and are then corrected reversibly, so that the book text can be recovered.
Corrections use curly brackets { (U+007B) and } (U+007D) around an insertion, and square brackets [ (U+005B) and ] (U+005D) around a deletion.
For a replacement, use an insertion followed by a deletion.
For example, "fear" when it should clearly be "féar" is encoded f{é}[e]ar.
Apart from clear errors, correction markup is sometimes used on spellings which are out of step with the overwhelming practice in the rest of the text.
But we do not, at this stage, try to normalise text, whether to an internal or external standard. Any spelling which is phonetically acceptable,
such as use of "a'" for "an" or "ag" is not altered, as such spellings may give valuable information about the spoken language.
(On the other hand, common spelling variation in a word may imply that there is no difference of sound.)
When the manuscript of a text is available, we may use this same correction marking to signal a change from the published text back to the manuscript text.
This use of the marking has been applied to very few texts, and far from exhaustively to those. This is a usage which requires to be much more thoroughly applied,
and it might be preferable to use a separate markup for this purpose.
Typographic features: small caps, bold face, italics, change of size
Plain-text cannot accommodate typographic features such as small caps, boldface, italics or change of point size.
Such features may be ignored, except when it is considered that their function is important enough to justify capturing them using markup.
In fact these typographic devices serve a wide variety of functions from one text to another — emphasis, foreign language, the name of a work of literature,
the name of a ship or a public house, a label or caption on a machine —
and the same functions are also served on occasion by use of quotation marks or by capitalization, which are features that we DO preserve literally.
The following encoding rules are unlikely to have been consistently applied.
Text which is typographically emphasised, generally by capitalisation or boldface, is placed between at-signs @ (U+0040).
It is recommended that at-signs, when required, should be placed outside any other applicable markup (such as foreign markup or name markup).
Text which represents the title of an external object - e.g. a public house, a ship, a work of literature - often printed in italics,
may be placed between quotation marks, double or single as context demands.
If the quotation marks are not already present, they are introduced between curly brackets: {“} … {”}
The opening quotation mark should come before any initial mutation.
If the typography simply marks the use of foreign language, no mark-up is added other than the foreign-language mark-up.
Names
The purpose of name marking is to support extraction of a list of proper names from a text (Program Ainmneacha). All names, whether native or foreign,
are marked.
A double dagger ‡ (U+2021) is placed before the start of the name, and another after the end of the name.
For example: ‡Rann na Feirste‡, ‡Seáinín an Dálaigh‡, go ‡h-Éirinn‡.
Notice that the marker is placed at the start of the word, before any mutation.
A dagger † (U+2020) is placed at the start and end of anything found within a name but which is not part of the name.
For example, ‡Seán† gorm-shúileach †Ó Briain‡.
Compare with ‡Seán Óg Ó Briain‡, where Óg is part of the proper name.
We mark all names of persons and places, but not their derivatives, such as Críostaí, Éireannach, Conallach, Frainncis.
There are difficulties in deciding what is a name.
Many items with initial capitals are not marked as names, e.g. Dia, Aifreann, Samhradh, Nodlaig, an Cogadh Mór, Turas na Croiche, Trí Rann agus Amhrán.
A few words are sometimes used as a name and at other times are generalised as a common noun, eg. Pharaoh(aí), Ptolemí, Caesar. All occurences of such words are marked as names,
regardless of their nature in a specific case.
Some words are homonymous between a name and a non-name sense: Iúdás as a common noun can mean a traitor, or a Jew; Múrach can mean a Moor, or a person named Moore;
Eabhrach can mean a Hebrew, or be a variant of Eabhraic (York). Only examples actually used as names are so marked.
We exclude a title like An Tighearna, An Bhean Uasal, Lord, Sir, Mr., Mrs. as a name component.
This may change! We exclude a loosely-coupled generic as a name component, eg. Cuan ‡Dhún na nGall‡, Báighe ‡Bheanntraighe‡, but not ‡Johnny Beag‡.
An article at the margin of a name is not regarded as part of the name, e.g. an ‡Clochán Liath‡.
If the article is internal to the name, we have no choice but to include it, e.g. ‡Rann na Feirste‡.
This may change! We exclude individual buildings from name marking, eg. Arc de Triomphe.
When extracting multi-element names from Gaelic text, it is clear that a mutation on the initial element of the name should be undone.
But there is no simple way of dealing with initial mutations on non-initial elements of a name, or with terminal inflections of any element.
For example, from a ‡Sheáin Uí Dhomhnaill‡ we would ideally extract "Seán Ó Domhnaill" but "Seáin Uí Dhomhnaill" seems to be the best that can be done automatically.
Or, from teach ‡Mháire Báine‡ or do ‡Mháire Bháin‡, we would ideally extract "Máire Bhán" but automatically
we may have to settle for "Máire Báine" and "Máire Bháin" respectively.
Components of a name — unless also marked foreign — are not excluded from lexical processing, since a name may contain productive lexical items (eg. Dún Pádraig, Johnny Beag),
or may even be entirely composed of such items, eg. An tAmadán Mór, An Fál Carrach.
The name Iúdás may be used as a noun, meaning an untrustworthy person; and it may also be used as a general word for a Jew.
These non-literal extensions of usage are not marked as names.
Foreign language
Material which is not in the matrix language of the text (foreign matter) should be placed within paired backslash characters \ (U+005C).
This marking enables such material to be excluded from selected processing, such as segmentation/demutation, lemmatisation or lemmatised index construction, which follows
the rules for the matrix language.
There is little difficulty in treating foreign matter extending to a whole sentence or more. However, single foreign words and short fixed phrases embedded in matrix text present some problems.
Foreign matter which is not fully assimilated to the orthography of the matrix language (eg. "Hóm Rúl", "phosphorus") is treated as foreign, even when (in Gaelic matrix text) it is subject to initial mutation or terminal inflection
following the Gaelic pattern (eg. "gnoithe an phledge", "an t-Abbé", "a Mhiss"). Since foreign matter will be exempted from the later auto-demutation process, it may be advisable
to perform manual demutation (gnoithe an \p^hledge\, an \^t^-Abbé\, a \M^hiss\) preventively at this stage on such short foreign items.
Foreign matter is also exempted from auto-segmentation. It is far from clear whether there are cases in which segmentation within foreign matter may be desirable; consider, for example,
"aide-de-camp" or "a man's a man" embedded in a Gaelic text. If segmentation is considered desirable in such a case, it may be performed manually preventively at this stage,
in the manner appropriate for the foreign language, eg. \aide+-+de+-+camp\, \a man+'s a man\).
Foreign names
A particularly important class of short foreign matter is that of foreign name-words (names or components of names). These will of course be marked as names, but we also mark them as foreign –
as one reason among many, we note that French names may contain articles and prepositions as elements, which are homographic with Gaelic language words (eg. le, de).
Foreign names are especially numerous in translations, where they may greatly outnumber other pieces of foreign matter.
By marking foreign name-words, we allow ourselves the option to exclude them from selected processing, such as making lemmatised indexes for the matrix language, where they may
constitute unwanted clutter, though we will still include them in forms indexes.
Notice that the marking of foreign name-words as foreign is a new policy, which has so far been applied to the Ulaidh collection of Gaelic texts, but not as yet to other collections of
Gaelic texts, or to non-Gaelic texts.
The criteria on which we have settled for marking name-words as foreign are more inclusive than originally envisaged. A form denoting a proper name (eg. Albain, Críost)
or a noun or adjective derived from a proper name (eg Albanach, Críostaí), or a noun or adjective denoting a group member even if not derived from a proper name (eg. Caitliceach, Gaedheal),
is marked foreign unless it satisfies both of two admission guidelines:
1. the form is fully assimilated to Gaelic orthography; and
2. the name-word is one which has a reasonable frequency, or is long established in the Gaelic language.
The first condition has been relaxed only in a very few cases, eg. Júdach, Israel, Moreibh, Híleans, Jaighneoir, Calhéim, Hacoinn.
Names like Dan, Jimmy, Joe, John, Mick, Hughy, Andy, Ned, Jack, Annie, etc, however spelled, are marked foreign
when they occur in foreign matter; for example, John is marked foreign in “John Mitchel” but not in “John Dhonnchaidh.”
The frequency threshold for the second condition is not clearly defined, and in practice the question asked is a subjective one,
viz. is the name-word frequent enough that would we wish to have it in a Gaelic dictionary, or would it simply be clutter?
So, a name-word of foreign origin, even though it is fully orthographically assimilated, undergoes initial mutation, and undergoes
terminal inflection according to Gaelic paradigms, may still be marked foreign if it is very infrequent.
The frequency consideration in the second condition is intended to reflect the intuition that "Lonndain" (for example) ought to be included in a lemmatised index, whereas
"Coirint" (again, for example) should not. But the dividing line is vague, and inconsistencies may abound.
We must expect to find a common foreign name-word Gaelicised in different ways, some of which may be marked foreign while others may not.
Sometimes a considerable number of different Gaelicisations may meet the non-foreignness criteria above, eg. none of the following are marked foreign:
Londain, Londainn, Londuin, Londúin, Londún, Lonndain, Lonndainn, Lonndan, Lonnduin, Lonndúin, Lonndún, and even London.
Likewise for India, Na hIndiacha (India); Indianach, Indiathach (an Indian); Indiach, Indianach, Indiathach (Indian, adj).
Fuller lists are available of forms which are marked foreign and forms which are not so marked, concentrating on forms considered borderline.
If a name-word which is not to be marked foreign occurs in foreign matter, it remains foreign.
Nouns or adjectives derived from foreign name-words are also marked foreign, eg. Maraîchine, Poitevin, Vendéan, an t-Idumean, Coirinteach, Edomach, Essénach, Macedoniánach,
Scythiánach, Gailliach. So also, under the same criteria, other foreign nouns or adjectives concerned with group membership, even though not derived from a personal or place name-word,
eg. Basque, Boer, Brahman, Gurcach, Hindu, Iroquois, Mahomatach, Ottamáin, Sadduichíneach, Sioux, etc.
As with other foreign matter, foreign name-words are exempted from auto-demutation and auto-segmentation. This is appropriate for such examples as Hindeberg, Thornton,
Shakespeare; Ben-Hur, Gore-Booth, Sol-leks. On the other hand, where demutation or segmentation is appropriate, it must be manually implemented, either preventively or
retrospectively, as in these examples: athair B^hen-Hur, tar annseo a B^hrunton, i ^gCollege Green.
Preventive markup
Later phases in the preparation of text will involve some automated processing, which will be fully described later. This processing will encounter particular situations
which may lead to error unless markup is inserted at this stage to guide it and avoid introducing errors. In this section we give an outline only of the automated processes,
and mention some situations where preventive action at this stage may be useful.
Note that the automated segmentation/demutation process is to be revised, as part of the development of a Gaelic lemmatisation process.
Segmentation
An automated process is later used to segment words into tokens, by inserting a plus sign + (U+002B) at the division point,
eg. má+'s, 's+eadh, sean+-+teach. Division points are often detectable by the presence of a hyphen or apostrophe, but there are also words which should be
divided even though neither character is present (eg. for Gaelic, sa > s+a, the segments being allomorphs of the preposition i and the article an respectively;
or: atá, imeasc, sna, arsan, don, den).
The segmentation process also handes (interactively) an end-of-line hyphen, which may indicate a broken word in which the hyphen is impermanent. If impermanent, the hyphen is changed to ^-.
(See demutation for the meaning of the hat character as mark-up.)
At this stage in processing, if we notice a word which would be wrongly segmented by the automated process, preventive manual segmentation may be applied. Alternatively, the error may
be manually rectified afterwards.
For example, in Gaelic, we might need to preventively segment "agamsa" (without hyphen), as it is not practical automatically to distinguish those words ending in "sa" or "se" which should be
segmented from those which should not. As another example, the word "'na" will be auto-segmented as 'n+a, appropriate for the meanings "in his/her/its/their",
"to the" (masc) and "to the" (fem). But when meaning "than", as more usually written "'ná", segmentation should be prevented.
It is advised to segment manually an unhyphenated compound if the hyphenated version of the compound occurs elswhere. For example, "galltrompa(í)", "galltrump(a)" and "gallchnó" have been manually
hypenated for this reason, while "gallbhuaile" and "gallóglach" have not. A problem with this advice is that the hyphenated version may turn up later in a newly-added text.
Also, in the very rare case of an end-of-line hyphen which is not a transient artifact of line-breaking, yet which should not cause segmentation (because separating the hyphenated parts
would produce lexically-meaningless components), then neither +-+ nor ^- is appropriate, but rather the hyphen should be left without additional markup.
This could arise in some hyphenated names (Fean-dubha-dadaidh-am), or in a case like gabht-/se or da-/réag (using / to represent a line-break), if the practice of the text is to use a hyphen in these words
even away from a line-end. It is more common in foreign material, but there the foreign marking blocks the auto-segmentation process from inserting +, and avoids any problem.
A special look-out may be kept in Gaelic texts for words which are totally elided (rather than simply reduced to an apostrophe). To re-insert the elided word would be phonetically unacceptable,
and instead we introduce «» to represent it, and place + between it and an adjacent (usually, following) word. Some examples:
• ní thiocfadh áit «»+fhághail ("a" elided)
• stadaigidh «»+dh'imirt bomaite ("a" elided)
• ní rabh «»+fhios agam ("a" elided)
• ag éisteacht le n-«»+athair ("a" elided)
• ní rabh ann ach cé «»+b'fhearr ("a" elided)
• mar «»+dubhairt an ghiorsach ("a" elided)
• níos gile na «»+bhí sé ("a" elided)
• comh luath agus «»+tháinig sí ("a" elided)
• goidé «»+tá tú «»+brath a dheánamh? ("a" and "ag" elided)
• chúig nó sé «»+bhomaití ("de" elided)
• tá mé marbh «»+mo luighe ("in" elided)
• «»+rabh tú riamh annsin? ("an" elided)
When we come to lemmatisation, the guillemets will be filled by a number to indicate the identity of the elided word.
Making the elision explicit like this may help beginning learners to understand the syntax (without misleading them as to the sound), and it may be critical
in any eventual sytactic analysis of text.
Auto-segmentation is not applied to foreign matter, eg. "aide-de-camp" or "a man's a man" embedded in Gaelic matrix text. Consequently, if it seems desirable to segment in such a case,
the segmentation must be performed manually, either preventively at this stage, or else retrospectively after auto-segmentation, in accordance with the properties of the foreign language.
Demutation (Gaelic only)
The same automated process which handles segmentation of words is also used to mark initial mutation of Gaelic words.
It introduces a hat character ^ (U+005E) before any character which is a result of initial mutation.
For example: f^hear, ^b^hfear, ^t^-athair, ^n^-athair, ^hathair
Non-initial lenition is not marked-up.
The automated process identifies words beginning with certain sequences of characters, for example, bh, ch, ... mb, gc, ... h-, n-, t-, ..., and marks those characters which are artifacts of mutation
by prefix a hat character ^ to them. The algorithm is somewhat more complex than this description suggests, and accesses lists of exceptions, such as words which
resist demutation, eg. chuig, (go) dtí, hata, halla, or which are ambiguous in this respect, eg. thart.
The list of words which are not automatically demutated for inclusion in the forms index has now been extended to include:
dhá (when it means "two") and various forms of "ag a"/"ag ár", such as gha, ghá, dha, dhá, dh', gh', 'gh, 'dh, 'ghá, 'ghár, 'dhá, 'dhár, etc.
At this stage in processing, if we notice a word which ought to resist demutation but which is not on the stored lists, and which we do not wish to add
to the lists, preventive manual demutation may be required. Alternatively, the error may be manually rectified afterwards.
Auto-demutation is not applied to foreign matter, thus avoiding false positives for demutation in foreign words, eg.
• ‡\Shakespeare\‡
• gan \phosphorous\
• \Hóm Rúl\
• iomlán na \Hanóibhéarach\
But, because in Gaelic matrix text we regularly find foreign words — both name-words and others — which are mutated, preventive (or else retrospective) demutation must be manually
applied to these, for example:
• tar annseo, a ‡\B^hrunton\‡
• in éadan na \^n^-Allies\
• an \^t^-Abbé\
• i ‡\^gCollege Green\‡
This is a major cause of the need for manual intervention.
End of non-final line of paragraph
At a later stage, an automated process will calculate the number of spaces to be counted at the end of a line, and will place a vertical bar character | (U+007C) after the requisite number.
For now, we may enter the vertical bar manually on any line where we notice that the automated process may be misled as to the number of spaces to leave.
For non-final lines of a paragraph, the number of spaces will normally be one, except where the text on the line ends in a hyphen, when no space is left before the vertical bar
(after possibly replacing the hyphen interactively by either ^- or +-+ — see here).
For a paragraph-final line, or any other line followed by blank lines, the number of retained spaces is calculated as one more than the number of following blank lines.
Manual prevention may be required with a line which is not paragraph-final and which ends with the end of a sentence. Two spaces should be left in such a case.
The automated testing for a sentence-end involves examining the punctuation, and may get it wrong:
• it may wrongly leave two spaces where there is a full stop which is actually only an abbreviation (although common abbreviations should already be catered for)
• it may wrongly leave two spaces where there is a question or exclamation mark, followed by quotation marks, but the sentence continues on the next line
(the testing will correctly leave only one space if the next line begins with "said", "arsa", etc., but there may be other such cases not catered for).
Parallel texts
Anything which is not present in the Gaelic version of a text may be omitted in other versions also, even when they are the original.
Examples range from a verse or quotation at the start of a chapter, to half the entire contents of a work (as with "Sgéalta Sealgaire.")
Exceptions to this practice, however, have been frequently made.
Footnotes added in translation are generally dropped, as being confined to one version of a parallel text.
Unpaginated texts
Texts obtained in digitised form from the internet — as may be the case for some non-Gaelic texts — do not always preserve the original book division into pages and lines.
In some cases, we may re-impose the pagination by hand, but in others this work may not be attempted.
Clearly, when a segment of an unpaginated text is retrieved later, it will not be possible to display the location (page and line) of the segment.
An unpaginated text should contain a pseudo-reference line <L 0> at the beginning, and no further <L reference lines.
End-of-line marking should be applied as normal to the lines of the text as it was received.
Examples of text after keyboarding are given below.
II.B Scanning and OCR
A good part of the work of keyboarding a text may be avoided by the use of scanning and optical character recognition (OCR).
It is worth remembering however that the results of OCR almost always required careful and labour-intensive editing.
For scanning text and OCR, I use ABBYY FineReader 5.0 Pro (2001) under Windows. I set the "formatting settings" for saving text to include "TXT"
and "keep line breaks" and "use blank line as paragraph separator" and "save type UTF-8".
The procedure depends in part on the language and script. We select the language for OCR as the matrix language of the text,
but we must then expect that any embedded material in other languages is liable to have its special characters mis-identified.
Non-Gaelic text in Roman script.
The appropriate "Language" setting must be selected. Otherwise the procedure is straightforward.
Gaelic text in Roman script.
The "Language" setting is "Irish", which is among the "Additional languages" in FineReader 5.0.
The language setting tells the program which set of characters need to be distinguished (the "alphabet"); this set can be edited.
For Irish the set is as for English plus the five acute-accented vowels, in both upper and lower case. No other special measures are required.
Gaelic text in Gaelic script.
For Gaelic script, an OCR program is required which can train its recognition font on the Gaelic script.
Historically the program most used for Gaelic script has been Optopus, produced by Makrolog GmbH,
for which development was discontinued in 1993 with version 1.4 beta, and which is no longer available.
Optopus can still be run under recent versions of MS Windows, where however it is subject to frequent breakdowns.
The only other readily-available scanning program which supports font training is FineReader, and this is what we now use.
To train FineReader to use a Gaelic font, numerous examples of each character must be tested and their identification manually corrected,
until some stability is achieved.
Part of the training involves the conversion of lenited consonants to digraphs (consonant + h), referred to by the program as "ligatures".
Even so, the inconspicuous nature of the dot of lenition makes it difficult to distinguish from a speck of dust, and its identification remains unreliable.
Retraining is generally required for each new text, however similar the book font may appear to those already trained.
Sections of text in a different font, including sections at a different typesize, are generally not recognized and must be input manually.
Despite these difficulties, scanning with FineReader is worthwhile.
German text in Fraktur script.
For this special purpose, the ABBYY FineReader Online Fraktur OCR service has been availed of,
as also has Tesseract (the frk language model) via gImageReader for Windows.
Unlike FineReader 5.0 when used offline, the online FineReader service does not provide the option of retaining the page and line structure of the original,
and we have to reconstruct this manually. gImageReader allows the structure to be preserved. A marginal benefit of gImageReader is that it can recognize Latin-script text embedded
in a Fraktur text.
Two examples of text after keyboarding or scanning and initial editing
This is how digitised text will look at this stage, before launching manual or semi-automated enrichment.
Example 1:
<L 7>
MAC AN IASCAIRE
Thiar sa tsean-tsaoghal bhí mórán de na daoine 'na
gcomhnaidhe cois cladaigh agus iad ag baint a mbeatha as
cnuasach na trágha agus as iasc na fairrge. San am a
bhfuil mé ag trácht air bhí buachaill san áit seo a chaith an
mhórchuid de n-a shaoghal ag iascaireacht ar an fhairrge. Mar
go rabh sé 'na bhádóir aireach ghlic bhí togha agus rogha de
ógmhná na tíre ag brath air; agus bhí a shliocht air, phós
sé an cailín a ba dóigheamhla sa pharáiste.
Bhí téagar agus carthannas eadar an lánamhain agus ní
tháinig aon fhocal searbh ariamh eatorra. Bhí rath agus bláth
ortha agus gach uile nidh acu ar a dtoil féin, ach amháin go
rabh siad ar easbhaidh cloinne. Bhí siad corradh le fiche
bliadhain pósta agus ní bhfuair siad aon duine de theaghlach,
agus annsin bhain siad dúil de shliocht.
Níor chothuigh sin mí-shásamh ar bith eatorra. Bhí siad trioll-
mhasach críonna, agus leag siad thart a bheagán nó a mhórán
i gcomhnaidhe fá choinne na “coise tinne.” Nó bhí 'fhios acu
go dtiocfadh an lá a n-imtheochadh an óige, agus mar dubhairt
an sean-rann:
“…mur' mbí sé agat féin,
Béidh tú fannlag is claon amuigh leis an ghréin.”
D'oibir an t-iascaire 'fhad is bhí sé in innimh buille rámha
a tharraint agus eangach a chur. Is iomdha oidhche anróiteach
a chaith sé ar an fhairrge le cascairt agus le fuacht, le dócal
iomartha agus le heasbhaidh bidh. Ach nuair a bhí sé ró-aosta
le lá oibre a dheánamh ghlac sé a scríste agus leig a mhaidí
le sruth.
Thoisigh siad annsin a bhaint as an lón a bhí deánta acu.
Ach nuair a bíthear i gcomhnaidhe ag baint as agus gan ag
cur ina cheann ní mhaireann sé i bhfad. Nuair a bhí an
t-airgead ar shéala bheith reaithte tháinig imnidhe ar an iascaire
go mbéadh droch-dhóigh ortha i ndeireadh a saoghail. Ní rabh
de sheift aige ach a ghabháil i gceann na hiascaireachta arís.
Bhí a chuid eangach stiallta stróctha agus b'éigean dó a
gcóiriú. Nuair a bhí deis aige ortha chuir sé isteach sa bhád
iad agus d'iomair amach go béal an chuain. Chuir sé na
heangacha le luighe gréine agus thóg sé arís iad le bánú an
lae, ach ní rabh dadadh ionnta ach meilearach agus slata mara. |
Shuidh sé ar na rámhaí agus tharraing isteach ar an bhaile.
Leis sin tchí sé chuige an long faoi sheol agus í ag clasú
na fairrge ar mhéad is bhí de shiubhal léithe. Scannruigh an
<L 8>
Example 2:
<L 1>
‡CLOICH CHEANN-FHAOLAIDH‡.
In ainm an Ríogh cuirim tús ar an
leabhairín seo “chum glóire Dé agus
onóra na ‡hÉireann‡.”
B'fhéidir nár bh'fhearr nídh dá ndean-
fainn 'ná rud éighinteacht a chur síos ar
thamall shubhailceach a chaith sgaifte
againn ar Árd-Sgoil ‡Cholumcille‡ anur-
aidh, i ‡gCloich Cheann-Fhaolaidh‡!
I ‡gCloich Cheann-Fhaolaidh‡! Cá bhfuil
‡Cloich Cheann-Fhaolaidh‡? adeir an léightheoir. Maise,
a léightheoir mhaith, shaoileas nach mbéadh féidhm orm sin
a innsin duit. Badh í an bharamhail a bhí agam nach rabh
Gaedheal ó chionn go cionn na ‡hÉireann‡ nach gcualaidh
trácht ar ‡Chloich Cheann-Fhaolaidh‡.
Dá gcuirtí ceist orm cá bhfuil an áit is Gaedheal-
aighe i gCúige ‡Uladh‡, bhéarfainn mar fhreagra nach
bhfuil áit ann comh Gaedhealach leis an cheanntar
talaimh atá os coinne Oileán ‡Thoraighe‡ isteach. 'Sí an
teanga a labhair ‡Columcille‡ imeasg sléibhte ‡Thír
Chonaill‡ na mílte bliadhan ó shoin a chluintear ag aosta
agus ag óg go fóill ann. Ar altóir Dé cuireann an
sagart i gcuimhne do'n phobal ann, i dteangaidh
‡Phádraic‡, go dtiocfaidh an lá nuair a chaithfeas cách
cunntas cruinn a thabhart do'n Chruthuightheoir mar chaith
sé 'ach a'n bhomaite dá shaoghal. Má's ag siubhal an
bhealaigh mhóir ann atá tú, castar airde de ghlúin de
pháiste ort nach dtuigeann acht an teanga a bhí fá
<L 2>
A problem with Russian texts
Russian texts generally do not distinguish between the letters e and ë, but often omit the accent on the latter.
For lexical purposes, however, the distinction is relevant, and we try to correct e to ë in our texts.
Program Udarjenie is intended to make this process less tedious.
The program prompts for the name of the input file, and produces an output file containing candidate forms for changing from e to ë.
These forms may then be located in the text and, if appropriate, changed manually.
End of digitisation phase
This concludes the basic digitisation of text (phase II.A/B).
We now continue to describe several semi-automated processes (phase II.C/D/E/F) which can enhance the value of the text.
Before proceeding to the next phase, it is suggested to run prepared text through Program Ainmneacha and Program Foreign, choosing to output the results in text order.
This will confirm whether the name delimiters and foreign matter delimiters, respectively, are properly paired.
If they are not, the semi-automated enhancement of the text may not function as desired.
Semi-automated enhancement of digitised text.
Under this heading are included:
• generation of wordlists, for checking accuracy of text
• adding end-of-line markers to text
• segmentation of tokens
• marking of initial mutations
II.C Word lists — Program Anailís
A suitable process to run on a text brought to this stage of preparation is to make an alphabetic list of word forms, which provides a powerful check on the accuracy of the text.
Human inspection of this list quickly draws attention to wordforms which are impossible.
Of course, since the wordforms are seen without context, it will not detect cases where an error has turned the real form into one which, though different, is also valid.
Program Anailís supports such checking. It can make an wordlist, an index locorum, or a concordance.
It is suggested that an alphabetically-ordered wordlist be generated from the text at this stage,
choosing program options to retain case and retain markup.
An index locorum should be generated at the same time, with the same parameters.
The wordlist may be inspected, looking for impossible or dubious spellings,
and the position in the text of any suspected errors can be located by reference to the index locorum.
The above images show the suggested parameters for a wordlist (left) and index locorum (right), made at this stage after initial text digitisation.
The name of the file containing the text should be entered in the topmost box (or use "Cuirtear lorg" to browse), and the "Deán liosta" button clicked.
The resulting wordlist may be examined for impossible or unlikely wordforms, and their page and line numbers obtained from the index locorum.
Here is a segments of a wordlist, showing a dubious wordform.
Annsin 2
anois 2
ant-uisge 1
anuas 6
aois 1
The dubious form "ant-uisge" suggests that a space between two tokens has been omitted. The corresponding section of the index locorum
anois 1933 LU043 11 17
anois 1933 LU043 11 30
ant-uisge 1933 LU043 14 22
anuas 1933 LU043 3 16
anuas 1933 LU043 5 32
shows that the potential problem lies on page 14, line 22 of this text.
As another example from a wordlist
chuir 3
Chuir 5
Chuireann 1
chuirfear 1
chuirfimís 1
The form "Chuireann" is dubious, as it is unlikely to be capitalized and lenited at the same time.
Use of Program Anailís, with different parameters, will be suggested again later.
II.D End of line marking — Program Deisiú
An automated process is now run to enhance the text by adding a vertical bar character | (U+007C) to every text line which does not already have such a character.
No vertical bar will be added to a blank line, or to a reference line.
Any line to which a vertical bar character has already been inserted manually is unchanged.
The number of spaces to be left before the vertical bar is calculated as follows:
• if the line is followed by one or more blank lines, the number of retained spaces is one more than the number of following blank lines.
• otherwise, the number of retained spaces will normally be one, except where the text on the line ends in a hyphen, when no spaces are retained.
Other cases are best dealt with preventively, but some attempt is made to handle them,
eg. two spaces are left after common sentence-final punctuation, but only one if:
• the line ends in the abbreviation .i.
• the following line begins with arsa, ar, or ars'
A sequence of at least 10 spaces is added to mark the end of the whole text.
The process is performed by Program Deisiú, menu option "Marcáil dheireadh líne".
The text file is first opened, using the Téacs main menu option, Foscail submenu option.
The process is triggered by choosing the Marcáil dheireadh líne main menu option — no further parameters are required.
When the process reaches the end of the file, choose Téacs, Sabháil Mar, then Amach.
This process appears to create some redundancy: an end of paragraph, for example, is now marked twice over: first, by three spaces before the "|" on its last line,
and secondly, by one blank line following it. When we come to considering parallel texts in different languages, we will see that
the in-line spaces are adjusted to secure alignment across the language versions, while the blank lines remain unaltered, reflecting the layout
of each particular language version.
Note that, as described under Sentences and Paragraphs, some texts may contain "type codes" after the vertical bar,
to classify any text units which end on that line. Current applications make no use of these type codes but ignore them.
II.E Token segmentation and demutation — Program Deisiú
Segmentation is the insertion of appropriate mark-up at hyphens and apostrophes.
Demutation, which is relevant only for Gaelic text, is the insertion of mark-up for characters which are the result of initial mutation.
Segmentation and demutation may be assisted by the use of Program Deisiú.
(Note that, for German text, in particular, where end-of-line hyphens are frequent and hyphens at other points are rare,
it may be easier to perform segmentation by global edits (with validation) rather than using Program Deisiú.
The few hyphens not at end-of-line (eg. Shakespeake-Abend) may be edited from - to +-+
after the end-of-line hyphens have been edited from -| to ^-| —
except for the infrequent case of a Shakespeare-Abend type of hyphen at end-of-line, when the -| will be made +-+|
The few apostrophes, e.g. in ist's may also be edited, in this case to ist+'s )
An automated process is used to segment words into tokens, by inserting a plus sign + (U+002B) at the division point,
eg. má+'s, 's+eadh, sean+-+teach. Division points are often detectable by the presence of a hyphen or apostrophe, but there are also words which should be
divided even though neither character is present (eg. sa > s+a, the segments being allomorphs of the preposition i and the article an respectively).
Regarding the hyphen: while many hyphenated words contain an identifiable prefix or suffix (eg. sean-teach, agam-sa), others do not (eg. N-N or ADJ-N compounds, such as ceann-tuigheadh,
meadhon-lae, maol-chnoc, mall-triallach). In view of this, we choose to make the hyphen a separate token (eg. sean+-+teach rather than sean-+teach).
Moreover we are likely to encounter "seanteach", which must be segmented into sean+teach, which analysis is more compatible with sean+-+teach than with sean-+teach.
The same automated process is used to mark initial mutation of Gaelic words.
It introduces a hat character ^ (U+005E) before any character which is the result of initial mutation.
This demutation operation is not applicable to languages other than Gaelic.
For Gaelic text, the operations of segmentation and demutation are applied iteratively,
that is, segmentation should be attempted first; and then the first segment (which will often be the only segment) should be considered for demutation.
If there is a second segment, it is now treated as if it were a complete word, and the whole process is repeated until no further segmentation is possible.
For example, the result of such an iterative application on "mo shean-chró" would be mo s^hean+-+c^hró.
Note that the automated segmentation/demutation process is to be revised, as part of the development of a Gaelic lemmatisation process.
The automated segmentation process by default breaks a hyphenated word into three tokens (except where, in Gaelic, the hyphen is part of an initial mutation).
An end-of-line hyphen may indicate a broken word, in which the hyphen is impermanent, and it should then be changed to ^-. (See demutation
for the meaning of the hat character as mark-up.) During the automated process, each example is presented for interactive decision between +-+ and ^-.
By default, an apostrophe at the start of a word is separated; an apostrophe elsewhere in a word does not cause a separation.
The default behaviour is supplemented by lists of exceptions, stored in plain-text files. Lists may include words which should be segmented even though they have neither
hyphen nor apostrophe; eg (for Gaelic) atá, imeasc, sna, arsan, don, den; or words which contain a hyphen but are not to be segmented, eg. (for Gaelic) dá-réag, gabht-se,
(for English) to-day, to-night, to-morrow. Among the exception lists are some where the word is ambiguous, so that different tokens may require
different treatment; each token will invite an interactive decision.
Auto-segmentation is not applied to foreign matter, so that hyphenated foreign words are retained as unities,
eg. aide-de-camp, Bourg-sous-la-Roche. (Of course if the elements are spaced, rather than hyphenated, they will be separated.) Segmentation of foreign matter is normally undesirable,
but if on occasion we wish to segment, we should enter the segmentation preventively, eg. a man+'s a man, as appropriate for the foreign language rather than the matrix language.
Even in foreign matter, an end-of-line hyphen is still presented for an interactive decision on whether the hyphen is permanent or not; if permanent, no markup is added.
This automated segmentation process also handles end-of-line hyphens. When the text of a line ends in a hyphen, just before the vertical bar,
the user is prompted for an interactive decision. If the hyphen should not be retained when the broken word is reconstituted, the hyphen is replaced by ^-,
while if the hyphen should be retained and should cause a division between tokens, the hyphen is replaced by +-+. (See immediately below for the use of the ^ character.)
(Unfortunately, a transient end-of-line hyphen will be visible when text is displayed, even when original line structure is not maintained. To remedy this, it would be necessary to mark it
by a new markup character other than ^.)
The same automated process which handles segmentation of words is also used for demutation, that is, to mark initial mutation of Gaelic words.
It introduces a hat character ^ (U+005E) before any character which is a result of initial mutation.
For example: f^hear, ^b^hfear, ^t^-athair, ^n^-athair, ^hathair
When the text is later processed, this hat character can be appropriately interpreted.
When tokenising the text, the hat and the character immediately following it are dropped.
When displaying text, the hat is dropped but the following character is retained.
In concordance terminology such non-lexical characters within words have been described as "padding", and the hat character may be said to function as a "padding escape".
Non-initial lenition is not marked-up.
The demutation process identifies words beginning with certain sequences of characters, for example, bh, ch, ... mb, gc, ... h-, n-, t-, ..., and marks those characters which are artifacts of mutation
by prefix a hat character ^ to them. (The algorithm is somewhat more complex than this description suggests.)
As with segmentation, the behaviour is supplemented by lists of exceptions, stored in plain-text files: ENRICH.DAT for Gaelic (Irish) and SGWORD.DAT for Gaelic (Scottish).
Lists include words which are resistant to demutation, either invariably, such as chuig, (go) dtí, halla, hata; or ambiguously, such as thart – tokens of ambiguous
words will cause the process to seek an interactive decision.
The list of words which are not automatically demutated for inclusion in the forms index has now been extended to include:
dhá (when it means "two") and various forms of "ag a"/"ag ár", such as gha, ghá, dha, dhá, dh', gh', 'gh, 'dh, 'ghá, 'ghár, 'dhá, 'dhár, etc.
Auto-demutation is not applied to foreign matter, thus avoiding false positives for demutation in foreign words, eg.
• \Shakespeare\
• gan \phosphorous\
• \Hóm Rúl\
• iomlán na \Hanóibhéarach\
Because we regularly find foreign words — in particular, isolated foreign words — which are mutated in Gaelic matrix text, preventive demutation must be manually applied to these, for example:
• tar annseo, a \B^hrunton\
• in éadan na \^n^-Allies\
• an \^t^-Abbé\
This hat markup can be useful with other types of characters which could be described as "non-lexical" and should be ignored in tokenisation and similar processing.
Such a use is with impermanent end-of-line hyphens, ie. those which are simply artifacts of line-breaking.
(Unfortunately, such a transient hyphen will be visible when text is displayed, even when original line structure is not maintained.)
These segmentation and demutation processes are performed by Program Deisiú. The text file is first opened, and then the Saidhbhriú main menu option is selected.
Of the two submenu options, Briseadh+díchlaochlódh is appropriate for Gaelic text (whether Irish or Scottish), while Briseadh applies to other languages.
A language-dependent plain-text file containing stored lists of words is provided, to fine tune the application of the processes:
It is called ENRICH.DAT for Gaelic (Irish); SGWORD.DAT for Gaelic (Scottish); BEWORD.DAT for English; and FRWORD.DAT for French.
No such files are provided for German or Russian, and any deviation from the default (list-less) segmentation process must be performed manually.
You are now asked to locate and open this file. The process then works its way through the text, performing segmentation and demutation, and end-of-line hyphen resolution.
Some common cases of ambiguity where an interactive decision is required:
• a word is encountered which is ambiguous with regard to demutation, eg. thart; you are asked whether or not this is a case of initial mutation;
the answer is yes if the word is tart (thirst; becoming t^hart), no if the word is thart (around; remaining unchanged)
• an end-of-line hyphen is encountered; you are asked if the word has just been broken because the space on the line has run out, eg. (using / here to represent a line-break)
siubhal-/famuid, becoming siubhal^-/famuid; or whether the hyphen is a more permanent part of the word (eg. Ros-/na-rí, sean-/teach, becoming Ros+-+/na+-+rí, sean+-+/teach).
(The very rare case where the hyphen is best left without additional markup requires manual handling.)
When the process reaches the end of the text, the enriched text should be saved using the Sábháil Mar option from the Téacs menu, and the program exited.
Apart from "Téacs", "Marcáil dheireadh líne" and "Saidhbhriú", the other main menu option of Program Deisiú are considered obsolete, and are not documented here.
The "Priontáil" and "Bail Phriontála" (Print Setup) suboptions of "Téacs" have never been implemented.
II.F Word lists — Program Anailís again
After segmentation and demutation, Program Anailís may be used again, both to produce a word list and an index locorum. This time, however,
it is suggested that the three option boxes at the lower right be ticked:
This choice neutralizes case differences and removes initial mutations, and results in a much shorter wordlist and index locorum than before,
which may highlight a different type of possible error.
For example:
DÉANADH 1
DEÁNAMH 3
DEÁNTA 1
shows variability in the placement of the length mark on the first syllable. It is worth checking whether this agrees with the printed text,
and correcting the digital text if necesasary. If the variation is present in the printed text, it might be judged to fall within the limits of acceptability,
and "correction" is hardly justified (in this instance).
The two example texts after the above semi-automated processing
Example 1:
<L 7>
MAC AN IASCAIRE |
Thiar sa ^tsean+-+^tsaoghal b^hí mórán de na daoine 'n+a |
^gcomhnaidhe cois cladaigh agus iad ag baint a ^mbeatha as |
cnuasach na trágha agus as iasc na fairrge. San am a |
^b^hfuil mé ag trácht air b^hí buachaill san áit seo a c^haith an |
m^hórchuid de ^n^-a s^haoghal ag iascaireacht ar an f^hairrge. Mar |
go rabh sé 'n+a b^hádóir aireach g^hlic b^hí togha agus rogha de |
ógmhná na tíre ag brath air; agus b^hí a s^hliocht air, p^hós |
sé an cailín a ba dóigheamhla sa p^haráiste. |
B^hí téagar agus carthannas eadar an lánamhain agus ní |
t^háinig aon f^hocal searbh ariamh eatorra. B^hí rath agus bláth |
ortha agus gach uile nidh acu ar a ^dtoil féin, ach amháin go |
rabh siad ar easbhaidh cloinne. B^hí siad corradh le fiche |
bliadhain pósta agus ní ^b^hfuair siad aon duine de t^heaghlach, |
agus annsin b^hain siad dúil de s^hliocht. |
Níor c^hothuigh sin mí+-+s^hásamh ar bith eatorra. B^hí siad trioll^-|
mhasach críonna, agus leag siad thart a b^heagán nó a m^hórán |
i ^gcomhnaidhe fá c^hoinne na “coise tinne.” Nó b^hí '+f^hios acu |
go ^dtiocfadh an lá a ^n^-imtheochadh an óige, agus mar dubhairt |
an sean+-+rann: |
“…mur' ^mbí sé agat féin, |
Béidh tú fannlag is claon amuigh leis an g^hréin.” |
D'+oibir an ^t^-iascaire '+f^had is b^hí sé in innimh buille rámha |
a t^harraint agus eangach a c^hur. Is iomdha oidhche anróiteach |
a c^haith sé ar an f^hairrge le cascairt agus le fuacht, le dócal |
iomartha agus le ^heasbhaidh bidh. Ach nuair a b^hí sé ró+-+aosta |
le lá oibre a d^heánamh g^hlac sé a scríste agus leig a m^haidí |
le sruth. |
T^hoisigh siad annsin a b^haint as an lón a b^hí deánta acu. |
Ach nuair a bíthear i ^gcomhnaidhe ag baint as agus gan ag |
cur in+a c^heann ní m^haireann sé i ^b^hfad. Nuair a b^hí an |
^t^-airgead ar s^héala b^heith reaithte t^háinig imnidhe ar an iascaire |
go ^mbéadh droch+-+d^hóigh ortha i ^ndeireadh a saoghail. Ní rabh |
de s^heift aige ach a g^habháil i ^gceann na ^hiascaireachta arís. |
B^hí a c^huid eangach stiallta stróctha agus b'+éigean dó a |
^gcóiriú. Nuair a b^hí deis aige ortha c^huir sé isteach sa b^hád |
iad agus d'+iomair amach go béal an c^huain. C^huir sé na |
^heangacha le luighe gréine agus t^hóg sé arís iad le bánú an |
lae, ach ní rabh dadadh ionnta ach meilearach agus slata mara. |
S^huidh sé ar na rámhaí agus t^harraing isteach ar an b^haile. |
Leis sin tchí sé chuige an long faoi s^heol agus í ag clasú |
na fairrge ar m^héad is b^hí de s^hiubhal léithe. Scannruigh an |
<L 8>
|
Example 2:
<L 1>
‡CLOICH C^HEANN+-+F^HAOLAIDH‡. |
In ainm an Ríogh cuirim tús ar an |
leabhairín seo “chum glóire Dé agus |
onóra na ‡^hÉireann‡.” |
B'+f^héidir nár b^h'+f^hearr nídh dá ^ndean^-|
fainn 'ná rud éighinteacht a c^hur síos ar |
t^hamall s^hubhailceach a c^haith sgaifte |
againn ar Árd+-+Sgoil ‡C^holumcille‡ anur^-|
aidh, i ‡^gCloich C^heann+-+F^haolaidh‡! |
I ‡^gCloich C^heann+-+F^haolaidh‡! Cá ^b^hfuil |
‡Cloich C^heann+-+F^haolaidh‡? a+deir an léightheoir. Maise, |
a léightheoir m^haith, s^haoileas nach ^mbéadh féidhm orm sin |
a innsin duit. Badh í an b^haramhail a b^hí agam nach rabh |
Gaedheal ó c^hionn go cionn na ‡^hÉireann‡ nach ^gcualaidh |
trácht ar ‡C^hloich C^heann+-+F^haolaidh‡. |
Dá ^gcuirtí ceist orm cá ^b^hfuil an áit is Gaedheal^-|
aighe i ^gCúige ‡Uladh‡, b^héarfainn mar f^hreagra nach |
^b^hfuil áit ann comh Gaedhealach leis an c^heanntar |
talaimh a+tá os coinne Oileán ‡T^horaighe‡ isteach. 'S+í an |
teanga a labhair ‡Columcille‡ i+measg sléibhte ‡T^hír |
C^honaill‡ na mílte bliadhan ó s^hoin a c^hluintear ag aosta |
agus ag óg go fóill ann. Ar altóir Dé cuireann an |
sagart i ^gcuimhne do+'n p^hobal ann, i ^dteangaidh |
‡P^hádraic‡, go ^dtiocfaidh an lá nuair a c^haithfeas cách |
cunntas cruinn a t^habhart do+'n C^hruthuightheoir mar c^haith |
sé 'ach a'n b^homaite dá s^haoghal. Má+'s ag siubhal an |
b^healaigh m^hóir ann a+tá tú, castar airde de g^hlúin de |
p^háiste ort nach ^dtuigeann acht an teanga a b^hí fá |
<L 2>
II.G Sentence alignment
Where a text has parallel versions in two or more languages, it is necessary to align them, so that
the corresponding sentences in the different versions can be located and compared. Algorithmically, the alignment units
of a text are taken to be separated by two or more spaces; these units will correspond quite closely with linguistic sentences,
and we continue to refer to them informally as "sentences", or as "units" when clarity is required. It remains to arrange that
these sequences of spaces which delimit units will correspond in all parallel versions of a text. This is accomplished by tweaking
the embedded spaces defining the unit boundaries.
It will be common to find two linguistic sentences in one language version corresponding to one sentence in another version.
In that case, either the two sentences should be separated by one space only, so that they form a single unit; or the one sentence may be split
into two by insertion of two spaces at a suitable point, making it into two units.
There is a great deal of latitude in this matter. General principles include:
• units should not be so short that they are uninformative when retrieved as contexts
• units should not be so long that their usefulness to identify translation equivalents is compromised
• the Gaelic version of a text should be accorded primacy, all other things being equal
Program Collate is used to assist in the alignment process. We initially present it with two language versions of a text,
prepared independently as described hitherto, and it will display them side-by-side in vertical columns, unit by unit.
Having identified the point at which the versions get out of step, the internal spacing in one of other is adjusted manually
to maintain alignment up to the next point of divergence. Program Collate will be rerun as often as necessary, and adjustments
made manually to the text files, until they are completely synchronized.
The program interface is extremely simple. It asks, first, for the level of units to display; this will normally be 1, meaning display
by sentences, but it may be higher, up to the number of levels of structure in the particular text.
The program then asks, in turn, for the name of each file to be compared. When the requisite number of files
have been named (two, in most situations), the prompt for the next (eg. third) filename is answered by simply hitting the return key,
which is taken to mean that no more text files are involved. The program then prompts for the name of a file to receive the output, ie.
the display of the input files in parallel vertical columns.
If Program Collate freezes or fails, check the following aspects of the input files:
• ensure there are no trailing spaces on any line
• ensure that all lines, except reference lines and blank lines, contain a | character
• ensure that the last line is followed by CRLF
The incomplete output file may indicate the location of the problem.
This example output from Program Collate shows a German translation of "Ben Hur", in the rightmost column, being aligned
at sentence level with the Gaelic and English version, which are already mutually aligned. The point at which the third column
loses alignment with the others shows where alteration of the internal spacing in this German text must recommence.
AONAD 196
Annsin labhair an Then, slowly at first, Und langsam, wie einer,
Gréagach; ach má labhair, like one watchful of der gewohnt ist, seine
ba go fadalach é i himself, the Greek began Worte zu wägen, begann
dtoiseach, mar bhéadh sé — der Grieche:
ar a choimhéad féin —
AONAD 197
“An sgéal seo atá liom, a “What I have to tell, my „Was ich euch zu erzählen
bhráithre, tá sé comh brethren, is so strange habe, meine Brüder, ist
hiongantach agus nach mó that I hardly know where so seltsam, daß ich kaum
ná go bhfuil a fhios agam to begin or what I may weiß, wo ich beginnen,
cá dtoiseóchaidh mé, nó with propriety speak. wie ich mich richtig
goidé is cóir damh a ausdrücken soll.
rádh.
AONAD 198
Is ar éigean a thuigim I do not yet understand Ich verstehe mich selbst
féin fós é. myself. noch nicht, nur dessen
bin ich gewiß, daß ich
den Willen eines Höheren
vollziehe, dem zu dienen
beständiges Entzücken
ist.
AONAD 199
Níl mé cinnte de nídh ar The most I am sure of is Wenn ich der Aufgabe
bith ach go bhfuil mé ag that I am doing a gedenke, die zu
déanamh Tola Maighistir, Master's will, and that vollbringen ich gesandt
agus gurab áthas liom i the service is a constant bin, so überkommt mich
dtólamh é. ecstasy. eine so unaussprechliche
Freude, daß ich nicht
zweifeln kann: es ist der
Wille Gottes, den ich
erfülle.“
AONAD 200
Nuair a smaointighim ar When I think of the Von Bewegung übermannt,
an rún atá leigthe liom, purpose I am sent to hielt der Redende inne
bíonn lúthgháir orm nach fulfill, there is in me a und die anderen senkten
bhfuil innse uirthi, joy so inexpressible that im gleichen Gefühle die
lúthgháir a bheir le fios I know the will is Blicke.
damh go bhfuil Toil Dé God's.”
liom.”
Program Collate can in fact display, in parallel vertical columns, up to ten variants of a text.
The embedded spaces in a text may be adjusted, as described, to impose a parallel hierarchical structure on language-versions of the same text, while independently,
each individual language-version can (if desired) preserve its own physical structure of paragraphs and larger units, by means of blank lines.
Blank lines — totally blank lines — may be inserted or deleted at will and will not disturb the parallelism, which depends only on horizontal sequences
of embedded spaces. Each system — blank lines and embedded spaces — can serve a different purpose, and there is no longer any redundancy between them.
The parallel hierarchical structure follows, as closely as practicable, the physical structure of the Gaelic version of the text.
A situation which is surprisingly common may be used as an example. A sentence in one version of a text may be split
over two paragraphs in another version. One version may be encoded as:
The first bit, the rest … |
while the second version may be:
The first bit. |
The rest … |
In the latter extract, the blank lines show the paragraph layout of this version, while the embedded spaces — only one at the end of the first paragraph —
allow both pieces of text to be treated as part of the same sentence.
II.H Lemmatization (non-Gaelic texts)
A standard way of enhancing digitised text is the automated assignment of lemmas and part-of-speech tags (POStags) to words.
For texts in English, French, German and Russian, we use the TreeTagger software and its associated language models.
This is a statistical procedure and the unavoidable error rate is reduced by manual post-editing.
TreeTagger is run through the Windows interface;
options files for the different languages are found in english.cfg, french.cfg, german.cfg and russian.cfg.
These options all use Program Tokenizer as an "own tokenizer" programme, to convert our normal text files to the form required by TreeTagger.
Note that german.cfg uses an auxilary lexicon german_auxlex.txt.
We will illustrate the process on a section of unlemmatised English text:
THE connection between the Irish people and the French |
Revolution is part of the general connection between |
Ireland and France which forms a most important factor |
in the history of Western Europe. |
This shows the Windows TreeTagger interface program, with the English options file english.cfg loaded. Change the input and output filenames to those required,
and click button Run.
If the Tokenizer window appears
Do not change any fields, but simply click the Run button; and when that btton changes to Exit, click it again.
Respond:
• to the TreeTagger prompt "Press any key to continue", by hitting the enter key
• to the "TreeTagger finished" window by clicking the "OK" button
• to the Treetagger Windows interface main window by clicking the "Exit" button
Here is a section of the output file (the spaces showing between the words are actually TAB characters):
THE DT the
connection NN connection
between IN between
the DT the
Irish JJ Irish
people NNS people
and CC and
the DT the
French JJ French
Revolution NN revolution
is VBZ be
part NN part
of IN of
the DT the
general JJ general
connection NN connection
between IN between
Ireland NP Ireland
and CC and
France NP France
For Russian texts, the lemmatisation can be improved by passing this TreeTagger output through Sharoff's lemmatization tool,
which requires the programming language Perl to be installed, and the CSTLemma application.
This process improves the lemmas assigned by TreeTagger, but does not alter the POStags.
To do this, download the CSTLemma Windows executable; and download Sharoff's Russian lemmatization tool, extracting all three files contained in it, but
• replacing lemmatiser.pl by our version, for compatibility with CSTLemma 7.0; the "use lib" line may need changed, depending on the location of smallutils on your machine
• downloading smallutils.pm if it is needed
• extracting msd-ru-lemma.lex from msd-ru-lemma.lex.gz
• adding our runsharoff.bat, in which the names of the input and output files should be inserted; ensure that cstlemma= is followed by the name of the downloaded windows executable. Then run runsharoff.bat.
The improved and unimproved lemmatised files should now be merged, after removing the "<guessed>" column from the improved file.
Where the files differ, the improved lemma is generally the better choice, but watch out for proper names, where the improved lemma will often be lower-cased,
and the unimproved lemma may be preferable (change the POStag from Nc to Np at the same time).
Watch out also for wordforms containing "ë", as TreeTagger and Sharoff's tool may only cater for the accentless forms of many such words.
Reference to Russian morphology websites such as morfologija may be useful.
The TreeTagger output (optionally after the extra step just described in the case of Russian) is now reformatted using the regular edit:
Replace ^(%*^)^t^(*^)^t^(*^)$ by ^1«^3/^2»
giving
THE«the/DT»
connection«connection/NN»
between«between/IN»
the«the/DT»
Irish«Irish/JJ»
people«people/NNS»
and«and/CC»
the«the/DT»
French«French/JJ»
Revolution«revolution/NN»
is«be/VBZ»
part«part/NN»
of«of/IN»
the«the/DT»
general«general/JJ»
connection«connection/NN»
between«between/IN»
Ireland«Ireland/NP»
and«and/CC»
France«France/NP»
The result of lemmatizing a text in one of the aforementioned languages is that each word is accompanied by a lemma and a POStag, in the format
token«lemma/POStag»
The set of POStags is determined by the model used by TreeTagger for the particular language. For English, for example, VBZ means "verb be, present tense 3s".
This is a convenient point at which to review the lemmatisation manually, and there will certainly be a lot which can be corrected, although much of it may not matter
for our purposes. For POStags, for example, we only require a coarse-grained classification (V, N, A, Z), and errors in finer details will have no effect on us.
Only a few points can be mentioned here; many more will be suggested by referring to our lemmatized texts.
Most obviously, erroneous lemmas or POStags need correction, but there are also some systematic changes to be made.
Numbers are assigned a lemma "@card@" by TreeTagger, but we replace these by the actual form. We assign lower-case lemmas to Roman numerals, eg. "xiv".
The period after an abbreviation is included in the lemma, eg. Mr«Mr./NN». Words in non-standard spellings (eg. antique, dialect) are unlikely to have been correctly handled.
Decapitalization of the lemmas assigned to proper names is a problem with several languages.
Lemmatisation of French or German results in multiple-choice lemmas, like "fil|fils" or "plaire|pleuvoir" or "Wind|Winde" and these must be resolved in each case.
In English, tokens of "won't", "can't", "shan't" and "ain't" must, at some stage, be assigned a sequence of two lemmas each, eg. won't«will/MD not/RB».
In Russian, words assigned a POStag of "-" should be manually re-assigned (except for punctuation marks, as the POStag assigned to these will be dropped and does not matter).
When this review is completed, Program Combine is run to produce the lemmatised text, by merging this reformatted TreeTagger output with the original unlemmatised text.
Program Combine, shown here
requires to be given the names of the two input files and the output file, then click the "Run" button. A part of the resulting lemmatised file is shown here
THE«the/DT» connection«connection/NN» between«between/IN» the«the/DT» Irish«Irish/JJ» people«people/NNS» and«and/CC» the«the/DT» French«French/JJ» |
Revolution«revolution/NN» is«be/VBZ» part«part/NN» of«of/IN» the«the/DT» general«general/JJ» connection«connection/NN» between«between/IN» |
Ireland«Ireland/NP» and«and/CC» France«France/NP» which«which/WDT» forms«form/VVZ» a«a/DT» most«most/RBS» important«important/JJ» factor«factor/NN» |
in«in/IN» the«the/DT» history«history/NN» of«of/IN» Western«Western/NP» Europe«Europe/NP». |
II.I Indexing and retrieval
Applications may be written to analyse the texts prepared as described above. Program Anailís and Program Collate, for example, previously mentioned as aids to
text preparation, can also usefully be applied to finished texts, as can Program Ainmneacha.
However the main supplied application is Tobar na Gaedhilge, which consists of two programs:
• an indexing program; and
• a retrieval program, to fetch material from the texts, using the indexes.
These programs will now be presented.
For indexing, the Gaelic texts are divided into collections, each corresponding to a range of continuity dialects:
• Ulaidh (Ulster)
• Connachta (Connacht)
• An Mhumhain (Munster)
• Oirthear (East of Ireland)
• Alba (Scotland)
It is probable that Scotland could be further subdivided.
Within each collection, texts are grouped by author; or in the case of East of Ireland, by region.
Non-Gaelic texts follow the pattern of their Gaelic parallels. As it happens, only texts in our Ulster collection
have non-Gaelic parallels; within that collection, the parallels (like the Gaelic versions) are grouped by the author
or translator of the Gaelic version.
For each Gaelic text collection, a word-form index is created.
For the Ulster collection, for each non-Gaelic language, two indexes are created: an index of word-forms, and an index of lemmas.
A lemma index should exclude foreign words, ie. words not in the matrix language of the version.
A word-form index should include all words, whether native or foreign.
Lemma indexes to the Gaelic text collections are under active consideration.
Wordform index Lemma index
Ulaidh Gaelic yes not yet
English yes yes
French yes yes
German yes yes
Russian yes yes
Connachta Gaelic yes not yet
An Mhumhain Gaelic yes not yet
Oirthear Gaelic yes not yet
Alba Gaelic yes not yet
The indexing program
Program Setup can perform several tasks on the prepared texts, but the one which concerns us here is to create an index.
For each index to be created, there is a data file, containing the instructions to the program:
• setupa.dat (Alba, Gaelic, wordforms), creates index ALBA
• setupc.dat (Connacht, Gaelic, wordforms), creates index CONNACHTA
• setupm.dat (An Mhumhain, Gaelic, wordforms), creates index MUMHAIN
• setupo.dat (Oirthear, Gaelic, wordforms), creates index OIRTHEAR
• setupu.dat (Ulaidh, Gaelic, wordforms), creates index ULAIDH
• setupu_b.dat (Ulaidh, English, wordforms), creates index ULAIDH_EN; setupu_b_lem.dat (Ulaidh, English, lemmas), creates index ULAIDH_EN_L
• setupu_d.dat (Ulaidh, German, wordforms), creates index ULAIDH_DE; setupu_d_lem.dat (Ulaidh, German, lemmas), creates index ULAIDH_DE_L
• setupu_f.dat (Ulaidh, French, wordforms), creates index ULAIDH_FR; setupu_f_lem.dat (Ulaidh, French, lemmas), creates index ULAIDH_FR_L
• setupu_r.dat (Ulaidh, Russian, wordforms), creates index ULAIDH_RU; setupu_r_lem.dat (Ulaidh, Russian, lemmas), creates index ULAIDH_RU_L
Each index created consists of four files, with the name of the index, followed by extensions .tex, .wor, .hea, .ent
Program Setup first invites the user to browse to the directory where the data files reside and where the index is to be created.
The program then asks which of the data files to use. (Click "Cancel" to input the data at the keyboard, instead of from a file.)
On choosing a data file and clicking on "Open", the data file is processed, with a rolling commentary displayed.
A successful program run will create or recreate the four index files for the index specified in the data file.
The user is given the opportunity to save the commentary to a file, which may be useful to gather statistics, or to diagnose any error which may have occurred.
Important! Program Setup uses the last character of the name of a text file to identify the language of the text, as follows:
b English eg. au018b.txt
f French eg. au018f.txt
d German eg. au018d.txt
r Russian eg. au018r.txt
g or anything else Gaelic eg. au018.txt
It is important that text file naming adheres to this convention, so that texts are placed in the correct indexes.
The retrieval program
Program Tobar is the supplied retrieval program, acting on the indexes created by Program Setup.
To download, install and use Program Tobar, see the webpage.
The HelpNDoc help project file is tobar.hnd. This file has not been updated since version 1.4.
The distributable archive is constructed using Inno Setup 5; the Inno project script file is tobar2018.iss.
II.J Future Lemmatisation of Gaelic texts
For our Gaelic texts, the development of a TreeTagger-type language model is impracticable due to the high degree of spelling variability.
In any case, POStags are not what will best serve to disambiguate words, but word-senses; and there is no readily and freely available system of
automated word-sense disambiguation, even for English.
We therefore approach the Gaelic texts empirically, and are in the process of developing our own numbered list of senses for each Gaelic word-form.
For forms which are included in FGB, the numbering of the senses begins with the senses in FGB; eg. ár«1» for "ár_destruction". The numbering is extended
to include, in random order, senses found here which are not present in FGB. Forms not in FGB have their senses randonly numbered from 1.
Numerous senses in FGB may not occur here, and will give rise to gaps in our numbering.
Our tokenisation has resulted in demutated, highly-atomistic tokens, tolerant of a wide range of spelling variation, and we must now label each token with
the appropriate sense number. As an extreme example of a decomposed lemmatization, we may take gurab = gur«1»+a«17»+b«2».
Up to now, sense-numbers have been attached to tokens manually and sporadically as the opportunity arises, but the most pressing task at this moment is to make
a concentrated effort at annotating texts with sense-numbers, supported by data files and programs.
A file, forms.txt already exists in a preliminary form, in which the numbered senses of forms are listed. For each sense, we record a hint to indicate the meaning;
the standard Irish form; and the FGB lemma. Then we give a favoured Ulster spelling, and a favoured Ulster lemmatisation. Other fields may be added, eg. a POStag.
This file requires to be put in a computational form, and programs written to maintain it, and to apply it to text.
We envisage a program to attach sense numbers to textual tokens, using the forms.txt file, and resolving ambiguous forms manually but interactively.
Eventually a more automated resolution may emerge, based on a tree-tagger-like statistical approach.
The assigned sense number will also determine the lemma (and the POStag), and a relatively simple conversion program, drawing on the forms.txt file, would transform
the text annotated with sense numbers to a form containing POStags and orthographic lemmas (like the lemmatised texts in non-Gaelic languages), ready for input to the Tobar indexing program.
The texts with the numerical lemmas might, however, continue to function as the master copy since they are more flexible, when used in combination with the forms.txt file
and the conversion program.
In due course, these projected developments are intended to produce a completely-disambiguated lemmatised Gaelic index, incorporated into Tobar na Gaedhilge.
The way is then open for major advances, such as the calculation of lemma-to-lemma translation equivalents. A lemmatised Gaelic index will also relieve the Gaelic forms index
of the burden which it carries from having to support some of the functions of a lemmatised index. For example, the degree of segmentation utilised in making the forms index
might be reduced, as might the number of textual corrections; both these factors will mean that the items in the forms index should reflect more accurately the tokenization of the author.
Looking even further ahead, when the texts bearing numerical lemmatization are made available directly for reading by a user, whether on a local computer or on the internet,
they could be offered for viewing in a variety of user-selectable orthographies, generated on-the-fly through the use of the forms file; such orthographic choices might include,
for example: dot vs h lenition; Latin vs Gaelic font; original spelling vs standard Irish vs standardised Ulster Gaelic vs other experimental spellings.
Finally we mention some other possible and desirable future developments:
• it is extremely important to examine the manuscripts, where available, and to reflect their content in our text files;
• sentence-level sound files, to provide aural enhancement of retrieved sentences;
• syntactic analysis of our digitised texts; our identification and lemmatisation of null tokens (completely elided words) is expected to play a vital role in this.
II.K Name lists (Gaelic texts only)
Program Ainmneacha makes a list of names from a text marked up appropriately.
The names are displayed with any marked-up corrections effected.
If the first element of a name carried initial mutation, that mutation is removed.
Initial mutations of non-initial elements are retained, and terminal inflections of all elements are retained.
This is liable to produce ungrammatical results on occasion, but seems to be the best that can be automated.
The list may be produced in alphabetic order, or in text order.
Text order may be used to verify that the name delimiters have been correctly paired.
Alphabetic order is best for studying the names occurring in a text.
II.L Lists of foreign matter
Program Foreign makes a list of items marked as foreign from a text marked up appropriately.
The items are displayed with any marked-up corrections effected.
The list may be produced in alphabetic order, or in text order.
Text order may be used to verify that the foreign delimiters have been correctly paired.
Alphabetic order is best for studying the foreign items occurring in a text.
Summary of markup which is inconsistent, unstable or untested
Here we summarise those aspects of our markup which are not used by current applications,
and as a result are not consistently implemented or have not been tested.
Such aspects include:
• reference categories other than <L>
• consistency and alignment of text units at higher level than paragraph — being actively addressed
• types codes added after the vertical bar on a line, to classify units ending on that line
• consistency in use of en-dash
• correction from the manuscript
• abolish legacy prefix characters: # and $
• the representation of small caps, bold, italic, or enlarged type by @ or by adding quotation marks
Markup which is likely to be revised:
• lemmatise Gaelic texts
• reduce quantity of text corrections as lemmatisation becomes more tolerant of phonetically-valid spelling variation
• extend name marking to non-Gaelic texts
• make foreign marking consistent and to include foreign names in non-Ulaidh and non-Gaelic texts
• non-marking as names of individual buildings, streets, etc.
• reconsideration of marking of word-joining by underscore
• non-inclusion of generic or lexical items in name marking
Summary list of programs
Program Anailís
Program Deisiú
Program Collate
Program WinTreeTagger
Program WinTrainTreeTagger
Program Tokenizer
Program Combine
Program Udarjenie
Program Setup
Program Tobar
Program Ainmneacha
Program Foreign
Conversion to XML, some comments
Apologies, this section is not well thought out, or comprehensive.
The tokens of a text are separated by a space character. A token may have up to four elements: type, lemma, pos, footnote_reference. To take an English example:
service«service/NN»‰1
might become
<w lemma="service" pos="NN" footnote="1">service</w>
In Gaelic texts, as of July 2021, the lemma consists only of a quasi-numerical code, and there is no separate pos, thus:
an«1»
might become
<w lemma="1">an</w>
(For the type "an", lemma 1 is the definite article, lemma 2 is the interrogative particle, lemma 3 is the interrogative copula.)
A document listing these lemma codes is still to be prepared. The codes used in these examples should not be taken as final.
It is intended in the future to map these lemma codes to lemma/pos pairs, and thus bring the Gaelic texts into line with the other languages.
A space-delimited token may in fact be split into several tokens, eg.
d'«4»+ól«1»
might become
<w lemma="4">d'</w><w lemma="1">ól</w>
The important point is that such tokens should NOT be separated by white space whenever the text is displayed.
Characters arising from mutations (and several other similar situations) are marked as padding (in the concordancing sense of that term).
b^hean«1»
might become
<w lemma="1">b<pad>h</pad>ean</w>
Padding characters should be ignored when appropriate, eg. in analysing the text, but not in displaying it.
Insertions and deletions:
{amh}[mha]rán
might become
<in amh /><out mha />rán
Ciarán Ó Duibhín
Úraithe 2021/10/29
Clár cinn / Home page / Page d'accueil / Hauptseite / Главная страница