[IEEE 2012 Fourth International Conference on Knowledge and Systems Engineering (KSE) - Danang,...

8
A Fast Decoder Using Less Memory Hien Vo Minh Faculty of Information Technology University of Science Ho Chi Minh City, Vietnam [email protected] Dien Dinh Faculty of Information Technology University of Science Ho Chi Minh City, Vietnam [email protected] Nhung Nguyen Thi Hong Faculty of Information Technology University of Science Ho Chi Minh City, Vietnam [email protected] Abstract Statistical Machine Translation (SMT) uses large amount of text corpus and complex calculation operation for translation process, which makes this method require more system resources for fast translation. In this paper, we introduce an approach of decoding in SMT using less memory but translating faster, which is more suitable for mobile applications and embedded systems. In our approach, the SMT models are stored in tree structures in order to speed up the loading process and the decoding algorithm is optimized to reduce operations. We apply our approach to English-Vietnamese and Vietnamese-English SMT systems. When translating 20,000 English sentences, which are 7.45 word lengths in average, we achieve 37.8 BLEU score, the average speed is 0.052 s. In case of Vietnamese-English system, we translate 20,000 Vietnamese sentences, which are 8.42 word lengths in average, the BLEU score is 34.63 with an average speed of 0.091 s. Keywords- statistical machine translation, decoding algorithm, improve decoding, mobile application, embedded system. I. INTRODUCTION Statistical Machine Translation (SMT), especially the phrase-based SMT has recently shown great advantages over other Machine Translation (MT) approaches for its translation quality and its ease to be adapted to new language pairs and new domains [10]. By these advantages, many translation systems are now available on the internet, like Google translate or Bing translator. However, it is more difficult when applying SMT to mobile devices or some embedded devices. SMT systems use large amount of data to train their statistical models, the resulting models could easily go up to several gigabytes when loaded into the memory [10]. Mobile devices and embedded devices have very limited dynamic memory, which makes it difficult to build a SMT system on these devices. Moreover, CPUs on these devices are weak and they do not have numerical co-processors [10], which are critical for text processing and calculating various probabilities in SMT system. Motivated by these restrictions, we proposed a new approach for decoding in SMT system, which is more suitable for mobile applications and embedded devices. In a SMT system, the basic models for translation are language model and translation model. In detail, the language model is based on the n-gram model and the translation model is based on the phrase table and reordering table. In our experiments, we train a corpus of over 20,000 English- Vietnamese pairs and the total size of three tables is over 28MB. The size of three tables may be up to GBs if we train with 1,000,000 pairs, while physical memory of mobile devices is just 256MBs or 512MBs. Therefore, it is not good to store all language model and translation model in memory. In our research, we aim to reduce the memory loaded while keep the speed of loading and decoding acceptable. To tackle these problems, instead of loading n-gram model and phrase table to memory, we use a special tree-based structure to store these models in hard disk. In this structure, each tree node contains information for decoding. This helps us to load decoding information during translation process in an efficient way. Moreover, our system manages words by giving each word an index number, and using a special structure to store the word mapping in hard disk which makes it still available for fast loading. In the decoding algorithm, we optimize the step of choosing translation options of a phrase, which reduces the number of hypotheses added to hypothesis stacks and significantly speeds up the decoding process. In order to address the problem of calculation operation, we map scores from floats to integers and exclude the logarithm operation from our decoder. We also introduce some techniques to reduce operation during decoding process and increase the speed of decoding. The new decoding algorithm and calculation operation may reduce the translation quality of the system. However, the experimental results show a competitive performance with the original system. The BLEU score of our system is 4% smaller compared to the BLEU score of MOSES system [5], while our system is 76.26% faster and using much less memory. The rest of the paper is organized as follows. Section II presents some related work of statistical machine translation. Section III describes more detail about our approach, such as how to manage words, the structure of n-gram model and phrase table, decoding algorithm and other techniques. Our experiments are shown in section IV. And finally we give some conclusion in section V. II. RELATED WORK A. Phrase-based statistical machine translation In SMT, there are many candidates when we translate a source sentence: = into the target sentence: = . Therefore, we choose the best one which has the highest probability value [1]. 2012 Fourth International Conference on Knowledge and Systems Engineering 978-0-7695-4760-2/12 $26.00 © 2012 IEEE DOI 10.1109/KSE.2012.11 173

Transcript of [IEEE 2012 Fourth International Conference on Knowledge and Systems Engineering (KSE) - Danang,...

A Fast Decoder Using Less Memory Hien Vo Minh

Faculty of Information Technology

University of Science Ho Chi Minh City, Vietnam

[email protected]

Dien Dinh Faculty of Information

Technology University of Science

Ho Chi Minh City, Vietnam [email protected]

Nhung Nguyen Thi Hong Faculty of Information

Technology University of Science

Ho Chi Minh City, Vietnam [email protected]

Abstract — Statistical Machine Translation (SMT) uses large amount of text corpus and complex calculation operation for translation process, which makes this method require more system resources for fast translation. In this paper, we introduce an approach of decoding in SMT using less memory but translating faster, which is more suitable for mobile applications and embedded systems. In our approach, the SMT models are stored in tree structures in order to speed up the loading process and the decoding algorithm is optimized to reduce operations. We apply our approach to English-Vietnamese and Vietnamese-English SMT systems. When translating 20,000 English sentences, which are 7.45 word lengths in average, we achieve 37.8 BLEU score, the average speed is 0.052 s. In case of Vietnamese-English system, we translate 20,000 Vietnamese sentences, which are 8.42 word lengths in average, the BLEU score is 34.63 with an average speed of 0.091 s.

Keywords- statistical machine translation, decoding algorithm, improve decoding, mobile application, embedded system.

I. INTRODUCTION

Statistical Machine Translation (SMT), especially the phrase-based SMT has recently shown great advantages over other Machine Translation (MT) approaches for its translation quality and its ease to be adapted to new language pairs and new domains [10]. By these advantages, many translation systems are now available on the internet, like Google translate or Bing translator. However, it is more difficult when applying SMT to mobile devices or some embedded devices.

SMT systems use large amount of data to train theirstatistical models, the resulting models could easily go up to several gigabytes when loaded into the memory [10]. Mobile devices and embedded devices have very limited dynamic memory, which makes it difficult to build a SMT system on these devices. Moreover, CPUs on these devices are weak and they do not have numerical co-processors [10], which are critical for text processing and calculating various probabilities in SMT system. Motivated by these restrictions, we proposed a new approach for decoding in SMT system, which is more suitable for mobile applications and embedded devices.

In a SMT system, the basic models for translation are language model and translation model. In detail, the language model is based on the n-gram model and the translation model is based on the phrase table and reordering table. In our experiments, we train a corpus of over 20,000 English-Vietnamese pairs and the total size of three tables is over 28MB. The size of three tables may be up to GBs if we train

with 1,000,000 pairs, while physical memory of mobile devices is just 256MBs or 512MBs. Therefore, it is not good to store all language model and translation model in memory.

In our research, we aim to reduce the memory loaded while keep the speed of loading and decoding acceptable. To tackle these problems, instead of loading n-gram model and phrase table to memory, we use a special tree-based structure to store these models in hard disk. In this structure, each tree node contains information for decoding. This helps us to load decoding information during translation process in an efficient way. Moreover, our system manages words by giving each word an index number, and using a special structure to store the word mapping in hard disk which makes it still available for fast loading.

In the decoding algorithm, we optimize the step of choosing translation options of a phrase, which reduces the number of hypotheses added to hypothesis stacks and significantly speeds up the decoding process.

In order to address the problem of calculation operation, we map scores from floats to integers and exclude the logarithm operation from our decoder. We also introduce some techniques to reduce operation during decoding process and increase the speed of decoding.

The new decoding algorithm and calculation operation may reduce the translation quality of the system. However, the experimental results show a competitive performance with the original system. The BLEU score of our system is 4% smaller compared to the BLEU score of MOSES system [5], while our system is 76.26% faster and using much less memory.

The rest of the paper is organized as follows. Section II presents some related work of statistical machine translation. Section III describes more detail about our approach, such as how to manage words, the structure of n-gram model and phrase table, decoding algorithm and other techniques. Our experiments are shown in section IV. And finally we give some conclusion in section V.

II. RELATED WORK

A. Phrase-based statistical machine translation In SMT, there are many candidates when we translate a

source sentence: ��� = �� ⋯ �� into the target sentence: ��� = �� … �� … ��. Therefore, we choose the best one which has the highest probability value [1].

2012 Fourth International Conference on Knowledge and Systems Engineering

978-0-7695-4760-2/12 $26.00 © 2012 IEEE

DOI 10.1109/KSE.2012.11

173

e∗ = �� � �(� | ��� ) = argmax ������ �)�(�) (1)

The probability of translating a sentence is based on two models, translation model ( ������ �) ) and language model (�(�)). The translation model determines how likely the target sentence is a translation of the source sentence and the language model determines how fluent the target sentence is.

The original SMT system [1] is word to word translation. A new system described in [2] based on phrase to phrase translation which translates with more contexts from source sentences. In this system, phrases are stored in a phrase table - the core information of the translation model. Figure 1 shows some English-Vietnamese phrase translation pairs automatically extracted from KMDC1 corpus.

Figure 1. Entries of the phrase table

Using log-linear model [7], we decide the best translation by choosing the translation with the highest score from the logarithm form of the probability formula (2).

� �(� | ��� ) = exp (∑ ������ ��(� | ���))� ����

�� is the functions that estimates feature values of (e, f) and �� is its factors. In our system, we use three feature functions; they are from language model, translation model and reordering model.

Formula (3), an expanded form of formula (2), is the full equation to calculate score of translation candidates, including score of n-gram model, translation model and reordering model.

�(� | � ) = �� � ����

����(��, ��)

+ �� � �(� − ��!� − 1)�

���

+ �"� � ���|#|

���$"�(��| �� … ��!�)

(3)

The translation model probability ( �(��, ��) ) is the probability the target sentence is translated to the source sentence. Additionally, using bidirectional translation probability, lexical weighting and phrase penalty [7] we have four more scores added to this probability:

� Inverse phrase translation probability �(�� , ��)� Inverse lexical weighting ���(��, ��)� Direct phrase translation probability �(��, ��)� Direct lexical weighting ���(��, ��)� Phrase penalty (always ��$(1) = 2.718).

Hence, the translation model score is calculated by formula (4).

�(� | � )%� = ��( � ����

����(��, ��) + � log ���(��, ��)

���

+ � ����

����(��, ��) + � log �����&, �&'

���+ exp (1) ) ����

The reordering score is the distanced-based reordering, described in [4].

The language model score is calculated by the smooth score and back off score [4] as shown in formula (5).

$"�(*-|*�, … , *-!�)=

⎩⎨⎧$-(*-|*�, … , *-!�) &��(*1 … *2)��&34' �56���(*�, … , *-!�)$"�(*-|*7, … , *-!�) �4ℎ��*&3�

(5)

A decoder searches for the best translation using the beam search algorithm [3]. At each step of decoding, the decoder chooses a phrase, collects all translation options of this phrase to create hypotheses and adds these hypotheses to a hypothesis stack. The hypothesis is scored by formula (3), the weak hypothesis with the low score is eliminated from the hypothesis stack.

A hypothesis, created during decoding process, is a translated part of a source sentence. A hypothesis stack is the collection of hypotheses. In the original decoding system, there are 1-word hypothesis stack, 2-word hypothesis stack, etc. Figure 2 illustrates these stacks.

Figure 2. Hypothesis and Hypothesis stacks in SMT

1KMDC is the English-Vietnamese corpus, which contains 20,000 pairsof sentences. This corpus is owned by the VCL group([email protected]).

174

Hypotheses in 2-word stack are created from Hypotheses of 1-word stack adding with 1-word phrase, and so on. The decoder takes the hypothesis of the full sentence (stored in the last hypothesis stack) and traces back to get the full target sentence.

Each hypothesis stack limits the number of hypothesesstored in it. Whenever the stack is full, the decoder prunes the hypothesis stack to its limit size, this helps the decoding process eliminate bad path during searching.

B. SMT system for mobile devices As described in the introduction section, it is more difficult

when applying SMT to embedded systems or mobile devices. There are some systems which introduce techniques to solve this problem [10] [11] [12] [14] [15].

FOLSOM [11] proposed a novel framework for performing phrase-based statistical machine translation using weighted finite-state transducers (WFST’s) that was significantly faster than existing frameworks while also being memory-efficient. They represented the entire translation model with a single WFST that is statically optimized and describe a new search algorithm that conveniently and efficiently combines multiple knowledge sources during decoding. The proposed approach is particularly suitable for converged real-time speech translation on scalable computing devices. This system was able to develop a SMT system that can translate more than 3000 words/second with 39.92 BLEU score on a test set of 377sentences.

PanDoRA [10] used a compact data structure to store n-gram model and translation model; therefore this system did not load the whole n-gram model and translation model to memory. They used the integerized computation which avoidedfloating point numbers in their system in order that their system could work on PDAs, which are built-in no numerical co-processors.

The decoder of PanDoRA system used both monotone decoding and reordering decoding. The advantage of monotone decoding is its fast speed, but it is only suitable for language pairs which have very similar word orders like Spanish-English. Therefore, this approach of decoding works poorly for language pairs which have very different word orders like Japanese-English. With language pair like Japanese-English, they used the ITG reordering decoding which employed “The Stochastic Inversion Transduction Grammar” [9]. Their Arabic-English system shows a result of 47.06 BLEU score.

Speechalator [14] was a two-way speech-to-speech translation system that runs in near real-time on a consumer handheld computer. It could translate from English to Arabic and Arabic to English in the domain of medical interviews. Their system (running in untethered mode) took around 2-3seconds to translate a typical utterance from when the speaker stops speaking, to when the system starts speaking the translation.

The system described in [15] developed a client-server speech translation system with mobile wireless clients. Their system performed speech translation between English and Japanese of travel conversation. Speech recognition accuracy

was measured using 1,800 utterances by 10 Japanese male speakers and 10 English speakers. The word accuracy was 97.5% for the Japanese speakers and 92.0% for the English speakers.

picoTrans [12] presented a novel user interface that integrates two popular approaches to language translation for travelers allowing multimodal communication between the parties involved: the picture-book, in which the user simply points to multiple picture icons representing what they want to say, and the statistical machine translation (SMT) system that can translate arbitrary word sequences. Their system could generate semantically equivalent sentences for 74% of the sentences in their evaluation data, and 66% of these sentences were correct.

Our approach inherits the idea of compact data structure to store n-gram model and translation model, but we use a different structure which is more suitable for language model and translation model [10].

III. OUR APPROACH

A. Words management In our system, we manage words by mapping each one with

an integer as its index. For example, with 20,000 words we map them with 20,000 integers ranging from 1 to 20,000. In word mapping stage, it is necessary to store a list of sorted word texts.

If the average length of a word is 5 characters, the words collection will require 20,000 * 5 bytes = 100,000 bytes (100 KB) to be stored in a device’s memory. Finally, we need total 200,000 bytes (200 KB) to store both source words collection and target words collection.

In our approach, we reduce the amount of memory by storing 1/16 amount of words, which means we store one word in memory then skip 15 words. As a result, it requires 200,000 bytes / 16 = 12,500 bytes (about 12.5 KB) to store in memory, which is more acceptable. Our proposed structure is shown in Figure 3

Figure 3. Words are stored in memory and hard disk.

175

Figure 4. Pseudocode of word index searching.

Now, the problem is how to determine the index of a given word. Let assume that we store 1/16 amount of words in an array, given a word text, search it in this array for its index (use binary search to increase the speed). There are two cases when searching this word:

� First case, the word exists in the array, then its word index is the index of it in the array multiply by 16

� Second case, the word does not exist in the array then we get the array index which is nearest to this word.Based on the nearest word index we access hard disk to load 15 words left into the second array. Search the word in the second array; we gain the index of this word. Finally, the word index is the index of the nearest word multiply by 16 plus the index of this word in the second array (Figure 4 is the pseudo code of searching a word).

For each word, we need to store the word text and the offset of 15 words left in hard disk. With 40,000 words (source words and target words), the system demands (20,000 / 16) * (5+4) bytes * 2 = 22,500 bytes (about 22.5 KB) stored in memory, which is 88.75% smaller compared to 200 KB if store all words in memory. This is a better way to store words with less memory but still keep the speed of loading words acceptable.

B. N-gram model N-gram model is an important part of a SMT system. It

influences the coherent of the target sentence. In our system, we use a trigram model, in which the longest n-gram contains 3 words. Instead of loading it into memory, we store n-gram model in hard disk using tree based structure.

As described in section A, each word is mapped with an integer, then the size of all unigrams is equal and all tree nodes in our tree based n-gram model is equal in size. If we store all unigrams in order of the word index, we can have the offset of the unigram of a given word based on its index. The system can load the unigram information based on this offset.

A tree node contains two probability values, one is the smooth weight of the n-gram and the other is the back-off weight of the n-gram. Other information is the number of

children of this tree node and the offset of its first child node. Figure 5 is an example of a tree node and its children.

Figure 5. A tree node of n-gram model and its children.

Using this structure, we access the n-gram node of a three word phrase after three following steps:

� Step 1: the system loads the unigram node based on theword index of the first word

� Step 2: when the unigram node is loaded, the system loads all its children. Searching for the second word in the children collection we have the bigram node

� Step 3: doing the same thing for the third word and finally we have the trigram node of this 3-word phrase.

Figure 6 demonstrates how tree-based n-gram model is stored.

Figure 6. List of tree node, offset and its children

C. Phrase table A phrase table is another important part of a SMT system.

Each entry contains three fields, including a source phrase, its translated phrase in the target language (the target phrase), and the probability this target phrase is translated to this source phrase (as shown in Figure 1).

To store phrase table, we also use the tree-based structure. Since storing phrase table is more complex than storing n-gram model, the structure of phrase table is more complex than the n-gram model structure. In our approach, we split the phrase table into 3 parts, one for searching, one for decoding process and one for getting the target translation.

We divide the phrase table into 3 parts because of its large size. The phrase table of a SMT system may contain millions of phrases, which mean millions of entries are stored in it. But

INPUT: word_text, word_array //(array of 1/16 words)OUTPUT: word_index

Pseudocode:

1. if word_text ∈ word_array then2. word_index = word_array_index * 16 3. else 4. word_array2= LoadArray(nearest_word_index)

//word_array2: array of 15 words left 5. if word_text ∈ word_array2 then6. word_index = nearest_word_index * 16 +

word_array2_index 7. else 8. word index = 0

176

not all of the information is necessary during the decoding process.

Figure 7. Three steps of processing a phrase

A given phrase is processed through three steps in the decoder (as shown in Figure 7):

� Step 1: the system searches this phrase to get its information

� Step 2: the system uses the information of this phrase, especially the probability, to decode

� Step 3: if this phrase exists in the best translation candidate, the information of this phrase is picked out to compose the target sentence.

Based on these three steps, we split the phrase table into three consistent parts corresponding to each step. The first part, used for searching, contains only source phrases.

The structure of the first part is similar to the tree-based structure of the n-gram model, but the information in each node is not the probability. Each tree node contains the number of translation options and the offset of the first translation option of this phrase, which is stored in the second part of phrase table. Figure 8 demonstrates the first part of phrase table.

In summary, each tree node in the first part of phrase table contains five values:

� Word ID (index of the source word of this node)

� Number of children

� Offset of the first child

� Number of translation options

� Offset of the first translation option.

Figure 8. First part of phrase table

The second part of phrase table stores all translation options of all phrases. This part does not have to be stored in a

structure because each translation option, stored in a node, is accessed by the offset provided by searching in the first part of phrase table.

Each node in the second part must be the same size in order to load all translation options just based on the number of translation options. Consequently, each node contains only the information enough for decoding process.

In runtime, the decoding process needs the translation probability of the translation option, n-1 first words and n-1 last words of the target phrase of this translation option (we use trigram model then n = 3). Figure 9 shows how the decoder calculates score of a full sentence.

As a result, we have to store this information of each node in the second part of a phrase table. Moreover, because each node must have the same size, we do not store the rest words of the target phrase. To trace back the full target phrase, we store all words of target phrase in the third part of a phrase table. Therefore, the last information need to be stored in the second part is the offset of the full target phrase stored in third part of a phrase table.

Figure 9. Using phrase information during decoding process

Consequently, each node of the second part of phrase table contains this information:

� Translation probability value

� N-1 first words of target phrase

� N-1 last words of target phrase

� Offset of the full target phrase.

Figure 10. First part and second part of phrase table

The third part of phrase table stores full target phrase of all phrases in a phrase table. Each node contains a number which is the number of words of this phrase and followed by all words of this target phrase.

This 3-part structure of a phrase table helps the decoding system load all necessary information during decoding process without slowing down its speed. Furthermore, based on this

177

structure, we can apply some techniques to reduce the system working during decoding process; these techniques will be described in section F.

D. Decoding algorithm In beam search algorithm, at each step of searching, the

decoding system collects all phrases and all their translation options to create hypotheses added to a hypothesis stack.

In our approach, we improve the speed of decoding by cutting off the collection of new hypotheses added to a hypothesis stack in each step. In other words, at each step, a smaller number of hypotheses are put into the hypothesis stack. This improvement significantly speeds up the decoder.

The question is how to cut off the collection of hypotheses added to a hypothesis stack. In beam search process, a given phrase has many translation options. For example, “rice” in English can be translated to “lúa”, “gạo” or “cơm” in Vietnamese. In our approach, we pick out the translation option which is the most consistent with the translated sentence part to create a new hypothesis. The most consistent translation option is the one that has the highest score when it is added to the translated sentence part.

For example, we translate the English sentence: “I eat rice with fish” to Vietnamese. The decoding process comes to the step: “I” and “eat” are already translated; we need to select the next phrase for the next step. In this case, we choose “rice” which has 3 translation options, including “lúa”, “gạo” and “cơm”. We pick out the translation option “cơm”, which has the highest score when added to the translated sentence part, to create a new hypothesis included in the hypothesis stack, and eliminate “lúa” and “gạo” (“Tôi ăn cơm” has higher score than “Tôi ăn gạo” and “Tôi ăn lúa”). Figure 11 illustrates our example.

Figure 11. Hypothesis stack of original algorithm and our approach.

The shortcoming of this approach is that the selected translation option has a local maximum score, not a global one. In the above example, “Tôi ăn cơm” has higher score than “Tôi ăn lúa” and “Tôi ăn gạo” at this step, but if we continue to other steps, the target sentence begin with “Tôi ăn lúa” and “Tôi ăn gạo” may have higher score. Therefore, this change in decoding algorithm may decrease the performance of translation. In the experiment section, we will show how much BLEU score are reduced when applying our approach.

The advantage of this approach is a large amount of hypotheses will be eliminated during the decoding process, which significantly increases the speed of decoding.

E. Scoring The score of a hypothesis is calculated by formula (3), it

comes from the logarithm form of the probability formula (2). Therefore, the system needs to calculate score using logarithm operation and float number. These operations will slow down the decoding process, especially when the system is in an embedded device. To solve this problem, we map the float scores to integer scores and exclude the logarithm operation from our decoding process.

To map score from float to integer, our approach does the same work described as [10], which maps the score into integer ranging from 0 to 4095.

The next problem is how to exclude the logarithm operation from our decoder. In order to do that, we calculate score of each phrase and store them in the translation option node (in the second part of phrase table described in section C). As a result, we do not have to store the translation probability of the phrase in the translation option node anymore. This means all translation option nodes store only one integer for its translation score.

We apply the same method to our language model; all of scores in language model is mapped into integers ranging from 0 to 4095. The factors of log-linear model are added to all scores too, which means the decoder does not use the factors.

In summary, scores of each node in phrase table is the sum of two score:

� Translation score: logarithm of translation probability multiply by the factor of translation model (��)

� N-gram score: the target phrase of a phrase node may be longer than n-1, which means we have to add the n-gram score to the score of phrase node.

Because we do not use factors during decoding process, after mapping scores of each node of n-gram model to integers, we multiply them by the factor of language model (�"�).

The reordering model is processed by the same method as the n-gram model. Each score is mapped into an integer and multiplied by the factor of reordering model (��).

F. Other techniques The special data structure of phrase table described in

section B and C makes it easy to apply some techniques to reduce system working during the decoding process and increase the speed of decoder. We describe these techniques in this section.

1) Pre-searching offsets of last two words of a phrase During decoding process, a phrase connects with another

phrase and the value of connection is the n-gram score of the last two words of the first phrase and the first two words of the second phrase. For the first phrase, we can pre-search last two words in n-gram model to get their offset. Therefore, instead of storing two last target words in a phrase option, the phrase option node stores two offsets of two last target words in n-gram model, which improves the speed of the searching n-gram node while decoding.

178

For example, if last two words of a target phrase is “ăn cơm”, instead of storing the word index of “ăn” and the word index of “cơm” in translation option node, we store offset of the bigram “ăn cơm” and offset of the unigram “cơm”.

2) Storing reordering scores Reordering model takes six values for previous scores and

following scores, these six values are necessary to be stored also. Using the structure of phrase table described in section C, we can store these six values and one value of the translation option score in the translation option node. It is costly to store all six values but it is necessary for the decoding process.

3) Caching n-gram model Due to the limit of memory, we cannot store the whole n-

gram model in memory. However, we can store a small part of n-gram model. During decoding process, some of n-gram nodes are loaded from n-gram model (which is stored in hard disk), after loading n-gram nodes we store them in memory for future use. Because during translation process, a n-gram node may be used more than one time, this caching work helps to reduce the slow loading n-gram.

4) Structure of Hypothesis stack A hypothesis stack is a collection of hypotheses. Each time

a hypothesis is added to the hypothesis stack, the system will check whether the stack is full or not, if yes the system will eliminate the hypothesis with the lowest score. In order to improve the speed of searching, adding and deleting hypothesis, the hypothesis stack uses a heap structure, in which the first hypothesis of the heap is the one with the lowest score.

5) Recombination, word penalty and phrase penalty The technique of recombination was described in [3],

which contributes in eliminating the bad path of searching. And for scoring, we use word penalty and phrase penalty described in [4] to increase the quality of translation.

IV. EXPERIMENTS

We applied our approach to English-Vietnamese and Vietnamese-English translation systems. These systems require 512KB memory in maximum, hard disk is up to 100MB and the speed requisition is less than 0.1 second per sentence. It is clear that these configurations are more suitable for embedded devices.

A. Corpus The training corpus contains 21,643 English-Vietnamese

pairs, in which the English sentences are 7.02 words in average and the Vietnamese sentences are 8.35 words in average. The testing corpus contains 20,000 English-Vietnamese pairs, in which the English sentences are 7.45 words in average and the Vietnamese sentences are 8.27 words in average. Table I and Table II are the statistic information of the training corpus and testing corpus.

TABLE I. STATISTIC OF TRAINING CORPUS

English Vietnamese Word tokens 151,876 180,619 Word types 10,652 6,492 Sentences 21,643 21,643 Avg. Sent. Len 7.02 8.35

TABLE II. STATISTIC OF TESTING CORPUS

English Vietnamese Word tokens 149,043 165,358 Word types 10,473 5,934 Sentences 20,000 20,000 Avg. Sent. Len 7.45 8.27

In our system, the training and testing corpus are lowercased in order that the extracted phrases are more convergent. To evaluate our approach, we compare our system with the MOSES decoder [5] by BLEU score [13].

B. Result Table III and IV shows that our system out-performs

MOSES in term of memory used and speed in both English-Vietnamese and Vietnamese-English system. More detail about this is discussed below.

In English-Vietnamese Translation system, total size of the original phrase table, n-gram model and reordering model is 29,287,498 bytes size. When we apply our new structure, these data are reduced to 14,707,396 bytes size, which is 50% smaller.

Our system uses the memory of 307.590 KB in average which is 40% smaller than the system requisition, while the MOSES system uses 72.2 MB memory when translating 20,000 sentences. Our system’s memory using is much smaller than the MOSES system.

The speed of decoding process is 1040 seconds (the average speed of decoding one sentence is 0.052 second),while the speed of decoding of MOSES system is 4380 seconds (the average speed of decoding one sentence is 0.219 second). It is easy to see that the speed of decoding in our system is 76.26% faster than the MOSES system.

The BLEU score of our system is 37.8 which is 4% smaller compare to the BLEU score of MOSES system (MOSES system BLEU: 39.29). This is due to the changes in decoding algorithm and integer scoring described in section D and E.

TABLE III. COMPARISON OUR ENGLISH-VIETNAMESE SYSTEM TOMOSES AND REQUISITION

Our system MOSES Requisition

Training data 14MB 27.9MB 100MB Memory use 307.590KB 72.2MB 512KB Speed 0.052s 0.219s 0.1s BLEU 37.8 39.29 35.36

With Vietnamese-English system, we achieve the same results. Our system is faster and uses less memory than MOSES. Please refer to Table IV for more details.

TABLE IV. COMPARISON OUR VIETNAMESE-ENGLISH SYSTEM TOMOSES AND REQUISITION

Our system MOSES Requisition

Training data 14.4MB 27.8MB 100MB Memory use 398.2KB 81.3MB 512KB Speed 0.091s 0.196s 0.1s BLEU 34.63 36.75 33.075

179

For mobile applications, our system is more flexible in memory using. Consequently, we can control the memory by ourselves, we may store all word mapping to memory, and have more space to store n-gram model caching.

V. CONCLUSION

Our approach aims to increase the speed of decoding process with the limit on memory using. Our experiment shows a result of faster translation but less memory requirement. Our system is successful deployed to some embedded devices and applied the technique in mobile application with Android and iOS mobile operation system as an offline mode of mobile application.

The performance of translation may be reduced when applying our approach, but still acceptable for embedded system and offline mobile applications.

In future work, we will apply more techniques to increase the speed and still maintain the translation quality.

REFERENCES

[1] P. F. Brown, J. Cocke, S. A. D. Pietra, V. J. D. Pietra, F. Jelinek, J. D. Lafferty, R. L. Mercer, and P. S. Roossin, “A statistical approach to machine translation,” Journal Computational Linguistics Volume 16 Issue 2, June 1990, pp. 79 – 85, 1990.

[2] P. Koehn, F. J. Och, D. Marcu, “Statistical phrase-based translation,” In Proc. of HLT/NAACL 2003, Edmonton, Canada, May 27 - June 1 2003.

[3] P. Koehn, “Pharaoh: a beam search decoder for phrase-based statistical machine translation models,” In Proc. of the 6th Conference of the Association for Machine Translation in the Americas, Georgetown University, Washington DC, September 28 - October 2 2004.

[4] P. Koehn, “Statistical Machine Translation,” 1st Edition, ISBN-13 978-0-511-69132-4, Cambridge University Press, 2009.

[5] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, R. Zens, M. Federico, N. Bertoldi, C. Dyer, B. Cowan, W. Shen, C. Moran, O. Bojar, A.Constantin, E. Herbst, "Moses: Open Source Toolkit for Statistical Machine Translation," In Proc. of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, 2007.

[6] F. J. Och, C. Tillmann, H. Ney, “Improved alignment models for statistical machine translation,” In Proc. of the Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 20-28, University of Maryland, College Park, MD, June 1999.

[7] F. J. Och and H. Ney, “Discriminative training and maximum entropy models for statistical machine translation,” In Proc. of the 40th Annual Meeting of the Association for Computational Linguistics, pages 295-302, Philadelphia, PA, July 2002.

[8] F. J. Och, “Minimum classifiation error training for statistical machine translation, ” In Proc. of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, July 2003.

[9] D. Wu, “Stochastic inversion transduction grammars and bilingual parsing of parallel corpora,” Journal Computational Linguistics Volume 23 Issue 3, September 1997.

[10] Y. Zhang, S. Vogel, “PanDoRA: A Large-scale Two-way Statistical Machine Translation System for Hand-held Devices,” In Proc. of MT Summit XI, Copenhagen, Denmark, Sep. 10-14 2007.

[11] B. Zhou, S. F. Chen, Y. Gao, “FOLSOM: A FAST AND MEMORY-EFFICIENT PHRASE-BASED APPROACH TO STATISTICAL MACHINE TRANSLATION,” Spoken Language Technology Workshop, 10-13 Dec. 2006.

[12] A. M. Finch, W. Song, K. Tanaka-Ishii, E. Sumita, “picoTrans: Using Pictures as Input for Machine Translation on Mobile Devices”, AAAI Publications, Twenty-Second International Joint Conference on Artificial Intelligence, 2011.

[13] K. A. Papineni, S. Roukos, T. Ward, W.J Zhu, “Bleu: a method for automatic evaluation of machine translation”, Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, 2002.

[14] A. Waibel, A. Badran, A. W. Black, R. Frederking, D. Gates, A. Lavie, L. Levin, K. Lenzo, L. M. Tomokiyo, J. Reichert, T. Schultz, D. Wallace, M. Woszczyna, J. Zhang, “Speechalator: two-way speech-to-speech translation on a consumer PDA”. In Proc. of Eurospeech 2003, pages 369-372, Geneva, Switzerland, September 1-4 2003.

[15] K. Yamabana, K. Hanazawa, R. Isotani, S. Osada, A. Okumura, T. Watanabe, “A speech translation system with mobile wireless clients. In The Companion Voluem to the Proceedings of 41st Annual Meeting of the Association for Computational Linguistics”, pages 133-136, Sapporo, Hokkaido, Japan, July 2003.

180