2 The shallow-transfer machine translation engine

This chapter describes briefly the structure of the shallow-transfer machine translation engine, which is largely based on that of the existing systems for Spanish–Catalan interNOSTRUM [1]  [5]  [4] and for Spanish–Portuguese Traductor Universia  [6]  [8], both developed by the Transducens group of the Universitat d’Alacant. It is a classical indirect translation system that uses a partial syntactic transfer strategy similar to the one used by some commercial MT systems for personal computers.

The design of the system makes it possible to produce MT systems that are fast (translating tens of thousands of words per second on ordinary desktop computers) and that achieve results that are, in spite of the errors, reasonably intelligible and easily correctable. In the case of related languages such as the ones involved in the project (Spanish, Galician, Catalan), a mechanical word-for-word translation (with a fixed equivalent) would produce errors that, in most cases, can be solved with a quite rudimentary analysis (a morphological analysis followed by a superficial, local and partial syntactic analysis) and with an appropriate treatment of lexical ambiguities (mainly due to homography). The design of our system follows this approach with very interesting results. The Apertium architecture uses finite-state transducers for lexical processing, hidden Markov models for part-of-speech tagging and finite-state-based chunking for structural transfer.

The translation engine consists of an 8-module 8-to-12 module assembly line, which is represented in Figure 1. To ease diagnosis and independent testing, modules communicate between them using text streams. This way, the input and output of the modules can be checked at any moment and, when an error in the translation process is detected, it is easy to test the output of each module separately to track down the origin of the error. At the same time, communication via text allows for some of the modules to be used in isolation, independently from the rest of the MT system, for other natural-language processing tasks, and enables the construction of prototypes with modified or additional modules.

We decided to encode linguistic data files in XML1http://www.w3.org/XML/-based formats due to its interoperability, its independence of the character set and the availability of many tools and libraries that make easy the analysis of data in this format. As stated in  [7], XML is the emerging standard for data representation and exchange in Internet. Technologies around XML include very powerful mechanisms for accessing and editing XML documents, which will probably have a significant impact on the development of tools for natural language processing and annotated corpora.

Figure 1: The modules that build the assembly line of the shallow-transfer machine translation system. Many monolingual packages in Apertium have both a statistical and Constraint Grammar-based morphological disambiguator. The 'discontiguous multioword processing' module, also called apertium-separable, has been introduced in 2018 and is optional. The current 'lexical selection module' has been added in 2012. The number and type of structural transfer modules can vary from a single 'chunker' module, to a three-stage, 'chunker-interchunk-postchunk' structural transfer with several modules at each stage. There is also alternative transfer module called apertium-recursive, which has been developed in 2019.

The modules Apertium consists of are the following:

The four lexical processing modules (morphological analyser, lexical transfer module, morphological generator and post-generator) use a single compiler, based on a class of finite-state transducers  [4], in particular, letter transducers  [3],  [3]; its characteristics are described in Section 4.1.3.

2For more information about the treatment of multiwords, please refer to page ~\pageref{ss:multipalabras}.