APERTIUM-KAZ:   A MORPHOLOGICAL TRANSDUCER AND DISAMBIGUATOR FOR KAZAKH
1 Extending apertium-kaz
1.1 Stems and categories
1.2 Lexicons
7.2

APERTIUM-KAZ: A MORPHOLOGICAL TRANSDUCER AND DISAMBIGUATOR FOR KAZAKH

WARNING: this is an early draft.

What follows is the documentation for apertium-kaz – a morphological transducer and disambiguator for Kazakh. First draft of this documentation was written, or, rather, assembled from various writings on Apertium’s wiki and then extended with more details by selimcan on September-October 2018 for members of the ‘Deep Learning for Sequential Models in Natural Language Processing with Applications to Kazakh’ (dlsmnlpak) research group at Nazarbayev University and elsewhere. That being said, I hope that it will be useful for anyone who uses apertium-kaz and maybe wants or needs to extend it with more stems or other features. Most of the things said in this guide should be applicable to Apertium’s transducers for other Turkic languages as well.

Apertium-kaz is a morphological transducer and disambiguator for Kazakh, currently under development. It is intended to be compatible with transducers for other Turkic languages so that they can be translated between. It’s used in the following translators (at various stages of development):

1 Extending apertium-kaz

1.1 Stems and categories

To extend apertium-kaz with new words, we need to know their lemmas and their categories. Below we list the possible categories of words (we ignore the so-called closed-class words here, as their likelihood to appear among unrecognized words at this stage is negligible, and simplify some of the categories of open-class words purposefully).

Category

Comment

Examples (from apertium-kaz.kaz.lexc file)

Nouns

N1

common nouns

алма:алма N1 ; ! “apple”

жылқы:жылқы N1 ; ! “horse”

N5

nouns which are loanwords from Russian (and therefore potentially with exceptions in phonology)

артист:артист N5 ; ! ""

баррель:баррель N5 ; ! ""

N6

Linking nouns like акт, субъект, эффект to N6 forces apertium-kaz to analyse both акт and акті as noun, nominative; both актты and актіні as noun, accusative etc. The latter forms are the default — that is, акті and актіні are generated for акт<n><nom> and акт<n><acc>, respectively, if apertium-kaz is used as a morphological generator.

N1-ABBR

Abbreviated nouns

ДНҚ:ДНҚ%{а%} N1-ABBR ; ! "DNA"

млн:млн%{а%}%{з%} N1-ABBR ; ! "million"

млрд:млрд%{а%}%{с%} N1-ABBR ; ! "billion"

км:км%{э%}%{з%} N1-ABBR ; ! "km"

Verbs

V-TV

transitive verbs

V-IV

intransitve verbs

If the verb can take a direct object with -НЫ, then it's not IV; otherwise it is TV

Proper nouns

NP-ANT-F

feminine anthroponyms

Сәмиға

NP-ANT-M

masculine anthroponyms

Чыңғыз

NP-COG-OB

family names ending with -ов or -ев

Мусаев

NP-COG-IN

family names ending with -ин

Нуруллин

NP-COG-M

family names not ending with -ов, -ев or -in; masculine

Галицкий

NP-COG-F

family names not ending with -ов, -ев or -in; feminine

Толстая

NP-COG-MF

family names not ending with -ов, -ев or -in which can be both masculine and feminine

Гайдар

NP-PAT-VICH

patronymes ending with -вич (and thus which can also take the -вна ending)

Васильевич:Василье NP-PAT-VICH ; ! \"\"

NP-TOP

toponyms (river names should go here too)

Берлин

NP-ORG

organization names

Қазпошта

NP-ORG-LAT

organization names written in Latin characters

Microsoft

NP-AL

proper names not belonging to one of the above NP-* classes

Протон-М

A1

adjectives which can modify both nouns (жақсы адам) and verbs (жақсы оқиды)

A2

all other adjectives

көктемгі

ADV

adverbs

әбден

If you want to add an adverb, first think whether the word is really an adjective that can be used like an adverb. If this is the case, then add it as an A1 adjective.

Figuring the lemma of an unrecognized word should be straightforward. Except for verbs, where the lemmas in apertium-kaz are 2nd person singular imperative forms such as бар, кел, ал etc (i.e. not бару, келу, алу as in print dictionaries), the lemmas are what you would expect to see in print dictionaries of Kazakh.

Still, there are some things to keep in mind (we use the word “stem” and “lemma” interchangeably below):

1.2 Lexicons

At the end of apertium-kaz.kaz.lexc, there are five lexicons:

In each lexicon, entries are sorted alphabetically with the LC_ALL=kk_KZ.utf8 sort command.

These five lexicons are where you have to put new words, after you have figured out their stems and categories following the guidelines above.

Abbreviations and Punctuation lexicons should be self-explanatory.

Any stem linked to lexicon starting with NP should be placed into LEXICON Proper.

Any (temporary) entry which involves tags, e.g.

қыл%<v%>%<tv%>%<gna_perf%>:ғып # ; ! "same as қып"

belongs to the Hardcoded section.

The rest of stems goes to LEXICON Common.