This documentation describes the Apertium platform, one of the open-source machine translation systems which originated within the project “Open-Source Machine Translation for the Languages of Spain” (“Traducción automática de código abierto para las lenguas del estado Español”). It is a shallow-transfer machine translation system, initially designed for the translation between related language pairs, although some of its components have been also used in the deep-transfer architecture (Matxin) that has been developed in the same project for the pair Spanish-Basque. Apertium can translate at present between the pairs Spanish-Galician, Spanish-Catalan1With the name Catalan we refer also to the Valencian dialectal variant of this language., Catalan-Occitan, Catalan-French, and can be used to build translators between other related language pairs, such as Danish-Swedish, Czech-Slovak, etc. 50 pairs have been released and are considered to be stable. They are listed on the wiki of the project and are showcased on apertium.org. Even more translators – in the beta stage of development – can be found on beta.apertium.org.
Existing machine translation systems available at present for the pairs es–ca and es–gl are mostly commercial or use proprietary technologies, which makes them very hard to adapt to new usages; furthermore, they use different technologies across language pairs, which makes it very difficult to integrate them in a single multilingual content management system.
One of the main novelties of the architecture described here is that it has been released under open-source licenses (in most cases, GNU GPL; some data still have a Creative Commons license) and is distributed free of charge. This means that anyone having the necessary computational and linguistic skills will be able to adapt or enhance the platform or the language-pair data to create a new machine translation system, even for other pairs of related languages. The licenses chosen make these improvements immediately available to everyone. We therefore expect that the introduction of this of open-source machine translation architecture will solve some of the mentioned problems (having different technologies for different pairs, closed-source architectures being hard to adapt to new uses, etc.) and promote the exchange of existing linguistic data through the use of the XML-based formats defined in this documentation. On the other hand, we think that it will help shift the current business model from a license-centered one to a services-centered one.
It is worth mentioning that “Open-Source Machine Translation for the Languages of Spain” was the first large open-source machine translation project funded by the central Spanish Government, although the adoption of open-source software by the Spanish governments is not new.
This documentation describes in detail the characteristics of the Apertium platform, and is organized as follows:
Chapter 2: general description of the shallow-transfer machine translation system and of the modules that make it up.
Chapter 3: description of the format of the data stream that circulates from one module to the next one.
Chapter 4: specification of the modules of the system. For each module there is a description of: the program and its characteristics, the format of the data that the module uses, and the compilers used for it. This chapter is divided in the following sections:
Section 4.1: Lexical processing modules, where the morphological analyser, the lexical transfer module, the morphological generator and the post-generator are described (Section 4.1.1), along with the format of the dictionaries used by these modules (Section 4.1.2) and their compilers (Section 4.1.3).
Section 4.2: Part-of-speech Tagger, which describes the tagger (Section 4.2.1) and the format of the linguistic data used by the tagger (Section 4.2.2).
Section 4.3: Pre-transfer module, which describes the module that runs before the structural transfer module to perform some operations on multiword units
Section 4.5: Structural transfer module, where there is a description of the program (Section 4.5.2) and of the format of the structural transfer rules (Section 4.5.4).
Section 4.6: De-formatter and Re-formatter, which describes these modules (Section 4.6.1, the rules for format processing (Section 4.6.2) and how these modules are generated (Section 4.6.3).
Chapter 5: it describes the way to install the system and to run the translator.
Chapter 6: here you will find an explanation of how to modify the linguistic data used by the translator, that is, the dictionaries, the part-of-speech disambiguation data and the structural transfer rules created in this project for Spanish, Catalan, Galician and many other languages. Furthermore, it contains a brief description of the characteristics of the available data for these three language pairs.
The files which this documentation refers to can be found at and downloaded from the project web page in Sourceforge at Github: https://github.com/apertium. From this page you can download the packages needed for installation, as well as view the individual files in the SVN (main) and CVS (residual) repositories of the project. The machine translation systems for the different language pairs can also be tested on the Internet at https://apertium.org/ (released versions) or https://beta.apertium.org (nightly versions). Besides translation modes proper, the latter website also allows to test individual morphological analysers or generators.
The Spanish Ministry of Industry, Commerce and Tourism has funded the development of this toolbox through the projects “Open-Source Machine Translation for the Languages of Spain”, code FIT-340101-2004-3, and its extension FIT-340001-2005-2, and “EurOpenTrad: Open-Source Advanced Machine Translation for the European Integration of the Languages of Spain”, code FIT-350101-2006-5, all of them belonging to the PROFIT program.
Workers and scholars from other machine translation projects at the Universitat d’Alacant: Míriam Antunes Scalco, Carme Armentano i Oller, Raül Canals i Marote, Alicia Garrido Alenda, Patrícia Gilabert i Zarco, Maribel Guardiola i Savall, Javier Herrero Vicente, Amaia Iturraspe Bellver, Sandra Montserrat i Buendia, Hermínia Pastor Pina, Antonio Pertusa Ibáñez, Francisco Javier Ramos Salas, Marcial Samper Asensio and Miguel Sánchez Molina.
The companies and institutions that have funded these other machine translation projects: Spanish Ministry of Science and Technology, Caja de Ahorros del Mediterráneo, Universitat d’Alacant and Portal Universia, S.A.
Iñaki Alegria, from the Ixa group of the Euskal Herriko Unibertsitatea (University of the Basque Country), for his close reading of previous versions of this document.
Google, who, through the Google Summer of Code and Google Code-In programmes, funded the development of several new modules.