APERTIUM-KAZ: A MORPHOLOGICAL TRANSDUCER AND DISAMBIGUATOR FOR KAZAKH
WARNING: this is an early draft.
What follows is the documentation for apertium-kaz – a morphological transducer and disambiguator for Kazakh. First draft of this documentation was written, or, rather, assembled from various writings on Apertium’s wiki and then extended with more details by selimcan on September-October 2018 for members of the ‘Deep Learning for Sequential Models in Natural Language Processing with Applications to Kazakh’ (dlsmnlpak) research group at Nazarbayev University and elsewhere. That being said, I hope that it will be useful for anyone who uses apertium-kaz and maybe wants or needs to extend it with more stems or other features. Most of the things said in this guide should be applicable to Apertium’s transducers for other Turkic languages as well.
Apertium-kaz is a morphological transducer and disambiguator for Kazakh, currently under development. It is intended to be compatible with transducers for other Turkic languages so that they can be translated between. It’s used in the following translators (at various stages of development):
1 Extending apertium-kaz
1.1 Stems and categories
To extend apertium-kaz with new words, we need to know their lemmas and their categories. Below we list the possible categories of words (we ignore the so-called closed-class words here, as their likelihood to appear among unrecognized words at this stage is negligible, and simplify some of the categories of open-class words purposefully).
Category | Comment | Examples (from apertium-kaz.kaz.lexc file) |
Nouns | ||
N1 | common nouns | алма:алма N1 ; ! “apple” |
жылқы:жылқы N1 ; ! “horse” | ||
N5 | nouns which are loanwords from Russian (and therefore potentially with exceptions in phonology) | артист:артист N5 ; ! "" |
баррель:баррель N5 ; ! "" | ||
N6 | Linking nouns like акт, субъект, эффект to N6 forces apertium-kaz to analyse both акт and акті as noun, nominative; both актты and актіні as noun, accusative etc. The latter forms are the default — that is, акті and актіні are generated for акт<n><nom> and акт<n><acc>, respectively, if apertium-kaz is used as a morphological generator. | |
N1-ABBR | Abbreviated nouns | ДНҚ:ДНҚ%{а%} N1-ABBR ; ! "DNA" |
млн:млн%{а%}%{з%} N1-ABBR ; ! "million" | ||
млрд:млрд%{а%}%{с%} N1-ABBR ; ! "billion" | ||
км:км%{э%}%{з%} N1-ABBR ; ! "km" | ||
Verbs | ||
V-TV | transitive verbs | |
V-IV | intransitve verbs | |
If the verb can take a direct object with -НЫ, then it's not IV; otherwise it is TV | ||
Proper nouns | ||
NP-ANT-F | feminine anthroponyms | Сәмиға |
NP-ANT-M | masculine anthroponyms | Чыңғыз |
NP-COG-OB | family names ending with -ов or -ев | Мусаев |
NP-COG-IN | family names ending with -ин | Нуруллин |
NP-COG-M | family names not ending with -ов, -ев or -in; masculine | Галицкий |
NP-COG-F | family names not ending with -ов, -ев or -in; feminine | Толстая |
NP-COG-MF | family names not ending with -ов, -ев or -in which can be both masculine and feminine | Гайдар |
NP-PAT-VICH | patronymes ending with -вич (and thus which can also take the -вна ending) | Васильевич:Василье NP-PAT-VICH ; ! \"\" |
NP-TOP | toponyms (river names should go here too) | Берлин |
NP-ORG | organization names | Қазпошта |
NP-ORG-LAT | organization names written in Latin characters | Microsoft |
NP-AL | proper names not belonging to one of the above NP-* classes | Протон-М |
A1 | adjectives which can modify both nouns (жақсы адам) and verbs (жақсы оқиды) | |
A2 | all other adjectives | көктемгі |
ADV | adverbs | әбден |
If you want to add an adverb, first think whether the word is really an adjective that can be used like an adverb. If this is the case, then add it as an A1 adjective. |
Figuring the lemma of an unrecognized word should be straightforward. Except for verbs, where the lemmas in apertium-kaz are 2nd person singular imperative forms such as бар, кел, ал etc (i.e. not бару, келу, алу as in print dictionaries), the lemmas are what you would expect to see in print dictionaries of Kazakh.
Still, there are some things to keep in mind (we use the word “stem” and “lemma” interchangeably below):
Many stems exhibit a voicing alternation like п/б, к/г, қ/ғ. This is processed automatically by apertium-kaz.kaz.twol, and such stems must be added with the voiceless consonant (п, к, қ) to apertium-kaz.kaz.lexc, e.g тақ:тақ V-TV ;
Stems from Russian that end with one of the voiced consonants (б, г), such as геолог should be entered as spelled, but should be put in the right category for foreign words (e.g., if a noun, then N5).
Words that have an inserted ‹ы› or ‹і› in some forms should get %{y%} in that spot on the right side, e.g. орын:ор%{y%}н N1 ;
There should be no infinitival final -у or -ю. It is best to take the part of the verb before -GAн or -DI in those forms.
Infinitives ending in -ю should end in ‹й› instead, e.g ‹сүю› should be entered as сүй
Some verbs have a "hidden" ‹ы› or ‹і› under the ‹у›, for example ері, аршы, аңды, etc. These verb stems should be added with the ‹ы› or ‹і›.
Of course, verbs with ‹у› in the stem should keep the ‹у›, like жу, қу, жау, etc.
Do not add passive or cooperative forms of verb stems (e.g., ‹тартыл› is passive of ‹тарт›, and ‹тартыс› is cooperative).
1.2 Lexicons
At the end of apertium-kaz.kaz.lexc, there are five lexicons:
Common
Hardcoded
Abbreviations
Punctuation
Proper
In each lexicon, entries are sorted alphabetically with the LC_ALL=kk_KZ.utf8 sort command.
These five lexicons are where you have to put new words, after you have figured out their stems and categories following the guidelines above.
Abbreviations and Punctuation lexicons should be self-explanatory.
Any stem linked to lexicon starting with NP should be placed into LEXICON Proper.
Any (temporary) entry which involves tags, e.g.
қыл%<v%>%<tv%>%<gna_perf%>:ғып # ; ! "same as қып" |
belongs to the Hardcoded section.
The rest of stems goes to LEXICON Common.