On this page:
3.1 Introduction
3.2 Data stream without format
3.2.1 Stream format
3.3 Segmented data stream
3.5.2

3 Format specification of the data stream between modules

3.1 Introduction

The format of the data that circulate between the engine’s modules has to be specified so that document processing is more effective and transparent. The proposed system design (see The shallow-transfer machine translation engine) imposes the need to use three different data stream types, as shown in Figure 2.

The stream format is text-based to facilitate, among other things, the diagnosis of possible system errors, since it is easy to manipulate the stream in order to reproduce the phenomena that are to be tested, and change it to see the result. Other benefits of using text streams are that it is possible to test independently the output of each module, and that it allows for fast building of prototypes to test the system’s global performance, the validity of linguistic data, etc.

Figure 2: The different data stream types in the machine translation system. See the text for its description.

The data stream types are:

We describe next the characteristics of the data stream used between the modules of the translator, that is, the second and the third stream types. In general terms, it is a plain text format marked with characters that have a special meaning. This format is intended for the processing in servers that translate large volumes of text.

Some of the formats that the engine can process may contain extensive blocks of information in binary format — RTF for instance, that may include bitmap images. To enable an efficient processing of this type of documents, we designed a way to extract this information and restore it after translation has been performed; see Format processing for a complete description.

3.2 Data stream without format

Data stream without format is output by the de-formatter and by the generator, and is used as input by the morphological analyser, the post-generator and the re-formatter.

In the subsection of this section you can find a description of the method to delimit superblanks and extensive superblanks. As an example we will use the HTML document in Figure 4.

<html>

  <head>

    <title>Title</title>

  </head>

  <body>

    <p>Divided

       sentence</p>

  </body>

</html>

Figure 3: Example of HTML document

The structural elements that must include this data stream type must include are the following:

3.2.1 Stream format

This format is based on the one used in the machine translation systems interNOSTRUM  [1] [5] [4] and Traductor Universia  [6] [8].

In this stream type, the characters [ and ] are used to indicate superblanks, as shown in the following example:

[superblank content]

In the case of extensive superblanks, the file name is specified using the at sign @:

[@file name]

The text is outside the superblank marks.

Artificial sentence endings are expressed by a full stop and an empty superblank right after it.

.[]

The following table shows the protected characters:

Name

Character

Protected form

Meaning

At

@

\@

External superblank

Slash

/

\/

Divider of meaning

Backslash

\

\\

Protection character

Caret

^

\^

Beginning of LF

Opening square bracket

[

\[

Beginning of blank

Closing square bracket

]

\]

End of blank

Dollar

$

\$

End of LF

Greater than

>

\>

Begin. of morph. symbol

Less than

<

\<

End of moprh. symbol

Figure 4 shows the document in Figure 3 after encapsulation.

[<html>

  <head>

    <title>]Title.[][</title>

  </head>

  <body>

    <p>]Divided[

       ]sentence.[][</p>

  </body>

<html>]

Figure 4: The document in Figure 3 with format encapsulated using square brackets

3.3 Segmented data stream

Segmented data stream is the stream that circulates between the modules that handle linguistic information in the translation engine. In this stream, words are delimited and labelled. There are two types of segmented stream:

Furthermore, besides the information already marked in the data stream without format, the new stream has to enable marking of the following information:

The symbols ^ for word beginning and $ for word end are used to delimit words, as shown in this example:

^word$

To separate the surface form and the following lexical forms, the symbol / is used. This separator only has sense in the ambiguous segmented stream, since in the unambiguous stream there is only the lexical form. It is used as follows:

^surface form/lexical form 1/...$

Lexical forms can include symbols (generally located at the end), as shown in the example of Figure 5.

[<html>

   <head>

     <title>]^Title/Title<n><m><sg>$^./.<sent>$[][</title>

   </head>

   <body>

     <p>]^Divided/Divide<vblex><pp>/Divided<vblex><past>$[

        ]^sentence/sentence<n><sg>/sentence<vblex><inf>$^./.<sent>$[][</p>

  </body>

<html>]

Figure 5: Example of segmented stream with format encapsulated in non-XML format, corresponding to the HTML document in Figure 3.