A Chinese-English Patent Machine Translation System Based on the Theory of Hierarchical Network of...

7
October 2012, 19(Suppl. 2): 140–146 www.sciencedirect.com/science/journal/10058885 http://jcupt.xsw.bupt.cn The Journal of China Universities of Posts and Telecommunications A Chinese-English patent machine translation system based on the theory of hierarchical network of concepts ZHU Yun (), JIN Yao-hong CPIC-BNU Joint Laboratory of Machine Translation, Institute of Chinese Information Processing, Beijing Normal University, Beijing 100875, China Abstract Compared with ordinary text, patent text often has more complex sentence structure and more ambiguity of multiple verbs. To deal with these problems, this paper presents a rule-based Chinese-English patent machine translation (MT) system based on the theory of hierarchical network of concepts (HNC). In this system, the whole procedure are divided into three main parts, the semantic analysis of the source language, the transitional transformation from the source language to the target language and the generation of the target language. The knowledge base and the rule set are obtained from manually analyzing the semantic features of a training set which contains more than 6 000 Chinese patent sentences, and a specific method of evaluation is provided during the experiment. Keywords patent machine translation, semantic analysis, semantic features, transitional transformation 1 Introduction To facilitate worldwide communication and cooperation, patent literature often needs to be translated into multiple languages. However, the volume of patent application is huge and it keeps growing, which brings increasingly heavy pressure in translation. In response to this pressure, the automatic translation of the patent literature has become an important area of the application of MT. In 2008, the NTCIR-7 listed patent MT as one of the evaluation issue, and since then, this task has been performed repeatedly [1]. Patent language, as a combination of technical language and legal language, has some distinctive features. Generally, patent literature has longer sentences, tedious and rigorous expressions, and fixed format. All these features cause more difficulties in MT. Nowadays, many of the existing patent MT systems are developed from ordinary MT systems, and researchers have not found a special approach to handle the difficulties of patent translation. As a result, the performances are lower than Received date: 05-08-2012 Corresponding author: ZHU Yun, E-mail: [email protected] DOI: 10.1016/S1005-8885(11)60430-5 ones of typical text translation. Although the characteristics of patent text brought such difficulties in translation, the unique way of expression can provide a great deal of information for the semantic analysis. This paper does not use statistical method. On the contrary, we focus on patent text, and through studying these features in expression and the rules in translating patent from Chinese to English which we can obtain from Chinese-English bilingual patent corpus, we presents a rule-based system which has three main steps for translating from source language into target language. The first step is the semantic analysis of source language, and after this step, the system can obtain the shallow parsing tree. The second step is the transitional transformation from the source language to the target language, and during this step, positions of some nodes of the parsing tree would be moved. The last step is the generation of the target language, and in this step, the system would give translation of every word based on its part-of-speech. The system is design to work in this way in purpose to improve the performance of the translation. This paper is organized as follows. In Sect. 2 we review the related works. Sect. 3 discusses the strategy of the translation. Sect. 4 introduces the algorithm and procedure.

Transcript of A Chinese-English Patent Machine Translation System Based on the Theory of Hierarchical Network of...

Page 1: A Chinese-English Patent Machine Translation System Based on the Theory of Hierarchical Network of Concepts

October 2012, 19(Suppl. 2): 140–146 www.sciencedirect.com/science/journal/10058885 http://jcupt.xsw.bupt.cn

The Journal of China Universities of Posts and Telecommunications

A Chinese-English patent machine translation system based on the theory of hierarchical network of concepts

ZHU Yun (�), JIN Yao-hong

CPIC-BNU Joint Laboratory of Machine Translation, Institute of Chinese Information Processing, Beijing Normal University, Beijing 100875, China

Abstract

Compared with ordinary text, patent text often has more complex sentence structure and more ambiguity of multiple verbs. To deal with these problems, this paper presents a rule-based Chinese-English patent machine translation (MT) system based on the theory of hierarchical network of concepts (HNC). In this system, the whole procedure are divided into three main parts, the semantic analysis of the source language, the transitional transformation from the source language to the target language and the generation of the target language. The knowledge base and the rule set are obtained from manually analyzing the semantic features of a training set which contains more than 6 000 Chinese patent sentences, and a specific method of evaluation is provided during the experiment.

Keywords patent machine translation, semantic analysis, semantic features, transitional transformation

1 Introduction �

To facilitate worldwide communication and cooperation, patent literature often needs to be translated into multiple languages. However, the volume of patent application is huge and it keeps growing, which brings increasingly heavy pressure in translation. In response to this pressure, the automatic translation of the patent literature has become an important area of the application of MT. In 2008, the NTCIR-7 listed patent MT as one of the evaluation issue, and since then, this task has been performed repeatedly [1].

Patent language, as a combination of technical language and legal language, has some distinctive features. Generally, patent literature has longer sentences, tedious and rigorous expressions, and fixed format. All these features cause more difficulties in MT. Nowadays, many of the existing patent MT systems are developed from ordinary MT systems, and researchers have not found a special approach to handle the difficulties of patent translation. As a result, the performances are lower than Received date: 05-08-2012 Corresponding author: ZHU Yun, E-mail: [email protected] DOI: 10.1016/S1005-8885(11)60430-5

ones of typical text translation. Although the characteristics of patent text brought such

difficulties in translation, the unique way of expression can provide a great deal of information for the semantic analysis. This paper does not use statistical method. On the contrary, we focus on patent text, and through studying these features in expression and the rules in translating patent from Chinese to English which we can obtain from Chinese-English bilingual patent corpus, we presents a rule-based system which has three main steps for translating from source language into target language. The first step is the semantic analysis of source language, and after this step, the system can obtain the shallow parsing tree. The second step is the transitional transformation from the source language to the target language, and during this step, positions of some nodes of the parsing tree would be moved. The last step is the generation of the target language, and in this step, the system would give translation of every word based on its part-of-speech. The system is design to work in this way in purpose to improve the performance of the translation.

This paper is organized as follows. In Sect. 2 we review the related works. Sect. 3 discusses the strategy of the translation. Sect. 4 introduces the algorithm and procedure.

Page 2: A Chinese-English Patent Machine Translation System Based on the Theory of Hierarchical Network of Concepts

Supplement 2 ZHU Yun, et al. / A Chinese-English patent machine translation system based on the theory of… 141

Sect. 5 shows the experimental result and analysis. And in the last section, we present some conclusions.

2 Related work

At present, the MT systems always adopted one of following methods, the rule-based method, the statistical method, and the hybrid-strategy method. The rule-based method has advantage that rules can describe language phenomenon directly and accurately. However, rules’ writing takes a lot of time and manpower, and the completeness and the concord are difficult to get. On the other hand, the statistical method is able to obtain result quickly. But due to the sparse data and other reasons, it lacks capacity to deal with some particular language phenomenon. Also, the large scale bilingual corpus is not easy to obtain [2]. Although it has these deficiency, the statistical machine translation (SMT) method were widely used. According to the data provided by organizers, there were 12 groups using statistical method among 15 groups in the patent translation task of NTCIR-7 workshop [1]. In order to compensate for the shortcomings and the deficiency of both methods, a hybrid-strategy method as a combination of these two MT methods has become a new trend.

For recognizing the main verb in Chinese sentence, recent research has adopted ways as follows. Some recognize the predicate head through constructing a statistical decision tree mode [3]. Some recognize the predicate head according to the predicate head of the corresponding English sentence [4]. Some combine a rule-based method with a multi-feature-based method [5]. Some obtain the head verb candidates through the conceptual symbol of Chinese characters or words [6]. Some identify the predicate head based on not only the static and dynamic grammatical features of the candidate predicate heads, but also the syntactic relations between the subject and the predicate [7].

For chunking and shallow parsing, support vector machine (SVM) is applied to shallow parsing [8]. Some presents a memory-based learning approach [9]. In Ref. [10], the author applies the shallow parsing for Portuguese–Spanish MT.

To obtain the dependency tree from different language, former researchers mainly adopted statistical method. Some formalized weighted dependency parsing as searching for maximum spanning trees (MSTs) in directed

graphs [11]. Some extract typed dependency parses of English sentences from phrase structure parses [12]. In Japanese, researchers analyze dependency structure based on SVMs [13] and a cascaded chunking model [14]. Zhou uses a hybrid model of phrase structure partial parsing and dependency parsing [15].

The Chinese-English MT always needs structural reordering. Researchers present a source-side reordering method based on syntactic chunks for phrase-based statistical MT [16]. Wang et al. proposed a reordering method based on a set of syntactic reordering rules [17].

Our research is based on the HNC theory [18]. This theory designed the language concepts space as a system with four level digital symbols, and each level has their symbolic expressions. The theory proposed a method for semantically analyzing Chinese texts grounded in the characteristics of Chinese expression, which provides us both the telescope and the microscope for observing the natural language.

3 The strategy of translation based on HNC theory

In this section, we would like to present the main task of every step and some semantic features used in our system.

Note that, HNC theory is not only a sound theory framework to analyze the problem. More importantly, it also provides a unified symbol representation of possible rules to solve our problem, which makes the software implementation of the rule-based natural language understanding (NLP) problem readily feasible. In the following examples, some important HNC principles are applied to form the rules. The main HNC principle used in this paper is the lv principle, one of the strategies for sentence analysis based on the HNC theory [19–20]. In this principle, l means logic concepts, which can be divided into many types, and all these types provide different kinds of information in different phases in recognition; v means verbs, and they themselves also have lots of semantic features.

We define three kinds of chunks, the entity chunk, the adverbial chunk and the predicate, to represent a sentence in the shallow parsing tree. The entity chunk includes the subject, the object and the object complement. The adverbial chunk modifies the predicate, and it may describe the time, the location, the way, etc. of the happening action.

Page 3: A Chinese-English Patent Machine Translation System Based on the Theory of Hierarchical Network of Concepts

142 The Journal of China Universities of Posts and Telecommunications 2012

3.1 Semantic features

There are several lv concepts and their semantic features used in our system, the most important of which are introduced in this section.

1) lbThere is a certain kind of logic concept that indicates the

small sentence (SS) separation. In our system, we call this kind of logic concept as lb. It is more like the conjunctions that connect sentences or clauses.

2) l0A predicate often has a subject and an object. Usually,

the sequence of a sentence would be ‘subject+predicate+ object’. Under the circumstances, the entity chunks are separated by the predicate. There is a kind of logic concept l0, which locates between two entity chunks. Also, to change sentence sequence, sometimes this kind of concepts is required, like ‘object+l0+subject+predicate (in passive voice)’. The logic concepts l0 can be divided into two groups. One is followed by subject and the other one is followed by object.

3) l1 & l1hThe adverbial chunk has its signal logic concepts. The

logic concept l1 always shows at the beginning of an adverbial chunk. Sometimes, at the end of the adverbial chunk, there is a signal logic concept to indicate the boundary. We call this kind of logic concepts as l1h.

4) Verb Verbs have particular semantic features. Some verbs can

be followed by a clause as the object. Some verbs need to change into passive voice when translate into English. Some verbs could have only two entity chunks, but some verbs can have three. Each verb is given a score based on its own semantic features and the logic concepts it related to.

3.2 Semantic analysis

In this step, the shallow parsing tree is obtained. This semantic analysis is based on the principle of boundary perception. The logic concepts and the verbs are used to divide a sentence into a several chunks.

There are two key problems in Chinese sentence analysis. The first one is that there may be several predicates in one Chinese sentence. As a result, one Chinese sentence should be separated into several parts according to the number of predicates. Each part is called

as an SS which can format an independent syntactic tree. Generally, an SS may composed by several words and only one of these punctuation marks (we call as stop punctuation in this paper) which contain comma, colon, semicolon, period, question mark, and exclamatory mark. However, sometimes an SS can have more than one comma, or a part of a sentence ended with a comma can be separated into more than one SSs. The second problem is that Chinese sentence may have more ambiguity of multiple verbs. Not every verb occurred in a sentence but the predicate can play a role of dividing sentence. Thus, the process of predicate recognition is needed in our system.

With the result of SS separation and predicate recognition, an SS could be divided by reasonable login concepts and verbs. Then, a shallow parsing tree can be generated from the division.

1) SS separation The process of SS separation mainly relies on the stop

punctuation mark and the logic concept lb.Pattern 1 lb+word sequence+stop punctuation In Pattern 1, the logic concepts lb is followed by several

words and a stop punctuation mark. It is a sign of the SS’s beginning.

Pattern 2 word sequence+lb+word sequence+stop punctuation

Sometimes, the logic concept lb occurs in the middle of a sentence. However, both parts of the sentence before and after this word have predicates. As a result, this concept divides this whole sentence into two SSs.

The SS separation does not only base on punctuation and certain logic concept lb. A reasonable SS requires a predicate. Thus, if a division of a sentence does not have a predicate, it cannot be separated as an SS.

2) Predicate recognition The semantic features which play an important role in

predicate recognition can be categorized into two groups. In one group, the semantic features raise the possibilities of a verb to be the predicate. On the contrary, the semantic features in the other group reduce these possibilities.

The semantic features in the former group can also be divided into three types. Some logic concepts or verbs combine with verbs to format a predicate candidate chunk. These logic concepts may change the tense, the voice and the aspect of the head verb. Due to these concepts, the verb they modified is in its finite form, which enhances the possibility to be selected as the predicate of the SS. Also,

Page 4: A Chinese-English Patent Machine Translation System Based on the Theory of Hierarchical Network of Concepts

Supplement 2 ZHU Yun, et al. / A Chinese-English patent machine translation system based on the theory of… 143

the predicate chunk could have a structure of “verb+verb”. In this kind of structure, the secondary verb could also indicate the tense or the aspect of the head verb, or the combination can indicate the concatenation of two actions. Thus, the possibility of a predicate candidate chunk with complex composition to be selected as the predicate is increased.

The occurrence of some logic concepts changes the sequence of an SS. Some logic concepts indicate the beginning and the end of an adverbial chunk. This adverbial chunk modifies the following verb, and its position in a sentence changes into the end of a sentence when a Chinese sentence is translated into English. In addition, the logic concept l0 also would change the sequence of the entity chunks.

Pattern 3 Entity Chunk1+l0+Entity Chunk2+Predicate+ Entity Chunk3

Pattern 4 Entity Chunk1+Predicate+Entity Chunk2+ Entity Chunk3

The chunk sequence in Pattern 3 may change into Pattern 4 after the translation. So we can infer that because the sentence sequence has changed, the possibility for the verbs related to these logic concepts to be the predicate candidates is increased.

There are two kinds of verbs have high possibility to be the main verb. The first one is verbs always followed by a clause as the object. The second one is certain kind of verbs that always used as the main verb in patent text i.e. “relate to/include/disclose/provide”.

However, not all the semantic features we paid attention to have positive influence on the related verb in increasing the possibility to be the predicate. On the contrary, some semantic features could decrease this possibility.

In our system, there is a kind of logic concepts, such as “this/some of”, represented by lu9. This kind of concepts indicates that the following is a noun or a noun phrase. This decreases the possibility for a verb to be the predicate, when it occurs after a lu9 concept.

Still, some predicate chunks are denied to be the predicate not only by lu9. The Chinese character ‘de’ is generally used between things and their modification. As a result, if a verb occurs before ‘de’, it always belongs to the modification or restriction. On the other hand, if a word occurs after ‘de’, it implies that this word is used as a noun.

3) Results of analysis After the semantic analysis, the shallow parsing tree is

obtained. The different nodes have their specific compositions respectively.

Each sentence (CS) ended with a full stop is the root of the tree, and it has several SS nodes and separators SST (lbor stop punctuation). CS= SS+SST�

Then, an SS node is composed by more than one and less then three entity chunks, one predicate. The adverbial chunk and the entity chunk separator l0 are alternative.

SS=Adverbial Chunk [alternative] + Entity Chunk 1 [alternative] + Predicate + Entity Chunk 2 + Entity Chunk 3 [alternative]

If there is a logic concept l0 in the sentence, the structure of the SS changes into following one.

SS=Adverbial Chunk [alternative] + Entity Chunk 1 [alternative] + l0 + Entity Chunk 2 + Predicate + Entity Chunk 3 [alternative]

As the son node of the SS, the predicate and the adverbial chunk may unfold to the next level.

Adverbial Chunk=l1+words sequence+l1h[alternative] Predicate=l/v[alternative]+head verb+l/v[alternative]

3.3 Sentence transformation

The main task of transitional transformation is to transform the original parsing tree into its legitimate form in target language. It mainly contains three kinds of operation. The first kind is to add or delete a node of the parsing tree, the second one is to change the position of a node in the tree, and the last one is to change some attribute values of a node.

1) Relation between SSs As we mentioned before, a Chinese sentence may have

more than one SS. However, Chinese is a kind of language that lacks of inflection. Thus, if a Chinese sentence with multiple SSs cannot be translated into a compound sentence, the system need to decide which SS is the independent clause and which dependent clause types the other SSs are. Even more, some SSs are changed into infinitive forms. To decide which form the SS should be transform into, some logic concepts and the sharing relation are required. For example, an SS beginning with a coordinating conjunction or a correlative conjunction should adopt the form of independent clause.

Pattern 5 Entity Chunk 1+Predicate 1+Entity Chunk 2+‘,’+Predicate 2+Entity Chunk 3+‘.’

In addition, the sharing relation helps to define relation

Page 5: A Chinese-English Patent Machine Translation System Based on the Theory of Hierarchical Network of Concepts

144 The Journal of China Universities of Posts and Telecommunications 2012

between SSs. Take Pattern 5 as an example, two SSs connected in this pattern usually indicates that the entity Chunk 2 is the subject of the Predicate 2, which is omitted to avoid the duplication. As a result, the second SS can convert into an attributive clause.

2) Reorder the sequence of entity chunks In Sect. 3.1, a kind of logic concepts l0 is introduced.

With different l0, the sequences of chunks could change in different ways when we translate them into English. Some l0 are used before the subject. In this situation, the Chinese sentence always in the following pattern.

Pattern 6 Entity Chunk 1(Object)+l0+Entity Chunk 2(Subject)+Predicate+Entity Chunk 3(Object 2/Object Complement[alternative])

To translate this kind of sentence into English, the sequence of chunks should be change into Pattern 7 or Pattern 8.

Pattern 7 Entity Chunk 2(Subject)+Predicate (in active voice)+Entity Chunk 1(Object)+Entity Chunk 3(Object 2/ bject Complement[alternative])

Pattern 8 Entity Chunk 1(Object)+Predicate(in passive voice)+Entity Chunk 3(Object 2 Object Complement [alternative])+ ‘by’+Entity Chunk 2(Subject)

Also, the l0 used before the object needs transitional transformation.

Pattern 9 Entity Chunk 1(Subject)+l0+Entity Chunk 2(Object)+Predicate+Entity Chunk 3(Object 2/Object Complement[alternative])

To translate sentence in Pattern 9 into English, the sequence of chunks should be change into Pattern 10.

Pattern 10 Entity Chunk 1(Subject)+Predicate(in active voice)+Entity Chunk 1(Object)+Entity Chunk 3(Object 2/Object Complement[alternative])

3) Adverbial chunks moving In Chinese sentence, an adverbial chunk always occurs

before the predicate. However, in English, the adverbial clause or prepositional phrase can be placed either at the beginning or at the end of a sentence. Hence, in the step of transitional transformation, if the adverbial chunk is at the beginning of the sentence and it is separated by a comma, its position in the parsing tree will not change. But if it appears in the middle of a sentence, it will be moved to last node of the sentence’s parsing tree.

Also, the adverbial chunk with both l1 and l1h needs an extra operation. Both these two logic concepts should be omitted in the parsing tree and add a new node with a proper conjunction or preposition.

3.4 Generation

The main task in the generation step is to determine the part-of-speech of each word and to transform each word into the reasonable form. To confirm the part-of speech (POS) of a word with ambiguity, our system mainly focuses on the previous word’s POS or the next word’s, based on the parsing tree after transitional transformation.

4 Algorithm of the translation

The system architecture of the system is showed in Fig. 1.

Fig. 1 The system architecture

Our system is a rule-based system, and it contains four modules, word segmentation, semantic analysis, transitional transformation and generation. All these modules need the support of the knowledge base and rule sets. In the following part, all these modules except word segmentation are discussed.

4.1 Semantic Analysis

Step 1 After word segmentation, send all the sentences to pre-segment the SSs based on the lb and the stop punctuation mark.

Step 2 Use l1, l1h and l0 to divide the SS. Step 3 Give every verb in the SS its score. Then sort

all the verbs in descending order according to their score. Select the one with the highest reasonable score as the predicate.

Step 4 If it cannot find a reasonable predicate in a pre-judged SS, combine the division with the following SS.

Step 5 Divide the SS into several chunks based on thel1, l1h, l0 and the predicate, and confirm every chunk’s type.

After these five steps, a shallow parsing tree of an SS is obtained. A sentence with multiple SSs forms a parsing forest.

Page 6: A Chinese-English Patent Machine Translation System Based on the Theory of Hierarchical Network of Concepts

Supplement 2 ZHU Yun, et al. / A Chinese-English patent machine translation system based on the theory of… 145

4.2 Sentence transformation

Step 1 Determine the relations between SSs with lband the sharing relations. Confirm which ones should be the independent clause, and which ones should be the dependent clause. Add the proper conjunctions or delete the redundant ones.

Step 2 Reorder the sequence of the entity chunks and the predicate in an SS with l0. Modify the voice of the predicate.

Step 3 Move the adverbial chunks to the proper position. Delete l1 and l1h and select reasonable conjunctions or prepositions.

Step 4 Modify the tense, the voice and the aspect of the verb in the predicate based on the logic concepts and verbs it related to.

4.3 Generation

Step 1 Determine the part-of-speech of each word with ambiguity according to the previous or the next word.

Step 2 Select the word with confirmed POS from the dictionary and transform it into the correct form.

5 Experiment and result analysis

The experiment collects 277 bilingual sentences as the main training set, which are selected by the experts of State Intellectual Property office of the people’s Republic of China (SIPO) from the patent corpus that were easy to introduce the translation mistakes, and more than 6 000 sentences extracted from 30 real whole Chinese patents as the secondary training set, which is used to build the dictionary and to supplement the collection of the rule set. A closed test is based on the main training set and an open test is based on the evaluation set.

The first step is to manually establish a knowledge base for all the words appeared both in the main and secondary training set, which has about 11 300 words and their semantic features. The second step is to summarize the rules for the different steps. Through analyzing the semantic features of logic concepts and verbs in the main training set, we summarized rules and obtained rule sets, which has almost 700 rules. Then we apply the human knowledge mentioned above into the system described in Sect. 5.

One important characteristic we emphasized in out system is the semantic analysis and the transitional transformation for the source language. Thus, we evaluate the key processes, instead of using BLEU score [21] to evaluate our system, which, in our opinion, mainly evaluates the word selection and cannot reflect whether the sentence structure is correct. We select the SS segmentation, the predicate recognition, the entity chunks reordering and the adverbial chunk reordering as four test point of our system and manually calculate the precisions. The results of the closed test are showed in Table 1.

Table 1 Results of closed test set Semantic analysis Transitional transformation

Total sentence P of SS

segmentation/%P of predicate recognition/%

P of entity chunk

reordering/%

P of adverbialchunk

reordering/%277 92 94 80 90

To emphasize the importance of the predicate recognition in semantic analysis and show the comparison, we also send these 277 sentences to Google online translation. The data from Google were obtained from analyzing the translation results. All the test results are showed in Table 2.

Table 2 Compared test results of closed test set Total Detected Correct P/% R/% F/%

OURS 363 354 333 94.1 91.7 92.9GOOGLE 363 288 217 75.3 59.8 66.7

In Table 2, OURS is the result obtained from our system, and GOOGLE is the one from Google online translation. From this table, we can see that the result from Google is much lower than the one from our system. Because our system gives intermediate results, and the Google’s results are inferred from the translation results. There mainly are two kinds of errors that occur in Google’s results. The first one is that more than one verb brings ambiguity in predicate recognition and Google translation selects wrong verb as predicate, which, for the most part, affect the precision. The other one is that the translation results always has no predicate because the correct predicate head is in non-finite forms, which mainly affect the recall. From this comparison, we can see that using our method to identify the predicate chunk performed well in the closed test.

After the close test, we take the open test on the evaluation set. The test results are showed in Table 3.

Page 7: A Chinese-English Patent Machine Translation System Based on the Theory of Hierarchical Network of Concepts

146 The Journal of China Universities of Posts and Telecommunications 2012

Table 3 Results of open test set Semantic analysis Transitional transformation Total sentence

P of SS segmentation/% P of predicate recognition/% P of SS segmentation/% P of predicate recognition/%Best 267 96 91 94 96

Worst 252 68 46 43 80 Average 6 308 83 75 74 85

From Table 3, we can see that all the precisions of the test points of the best result in 30 pieces of patent are over 90%. The test result shows that our method can improve the accuracy of parsing, and thus, based on correct syntactic tree, the performance of MT system can be improved. We continue to supplement and amend the rule set during the open test.

Through analyzing the experiment results, we found that because of the pure rule-based method, the system has high dependency on the completeness of rules and the accuracy of the knowledge base. Therefore, the rules we obtained is not enough. We need to add some complementary rules for special language phenomenon to improve the performance. However, we still can see that the system has room for improvement. We believe that this method can be put into practical application after we complete the rules and knowledge base.

6 Conclusions

In this paper, we present a rule-based system for Chinese-English patent machine translation. Through the closed test and the open test, the satisfactory results are obtained. We propose this method in order to improve the performance of the Chinese-English patent MT system based on the syntactic tree we provide.

In the future, further work should be undertaken to complete the rules to enhance the performance of our semantic analysis and transitional transformation, and the entity chunks need to unfold to get more accurate analyzing result.

Acknowledgements

This work was supported by the Hi-Tech Research and Development Program of China (2012AA011104), the Fundamental Research Funds for the Center Universities.

References

1. Fujii A, Utiyama M, yamamoto M, et al. Overview of the patent translation task at the NTCIR-7 workshap. NTCIR-7 Workshop Meeting, Japan, 2008

2. Dai X, Yin C, Chen J, et al. Machine translation: past, present, future. Computer Science, 2004, 31,(11): 176�179, 184 (in Chinese)

3. Sui Z, Wen S. The acquisition and application of the knowledge for recognazing the predicate head of a Chinese simple sentence. Acta Scientiarum Naturalium Universitatis Pekinensis, 1998, 34(2�3): 221�230 (in Chinese)

4. Sui Z, Wen S. The research on recognizing the predicate head of a Chinese sentence in EBMT. Journal of Chinese Information Processing, 1998, 12(4): 39�46 (in Chinese)

5. Gong X, Luo Z. Recognizing the predicate head of Chinese sentences. Journal of Chinese Information Processing, 2003, 17(2): 7�13 (in Chinese)

6. Wei Z, Xiong L, Zhang Q. Research on automatic acquiring head verb of Chinese sentences. Computer Engineering and Applications, 2007, 43(10): 179�182 (in Chinese)

7. Li G, Meng J. A method of identifying the predicate head based on the correspondence between the subjuct and the predicate. Journal of Chinese Information Processing, 2005, 19(1): 1�7, 41 (in Chinese)

8. Kudo T, Matsumoto Y. Chunking with support vector machine. The Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, 2001: 1�8

9. Daelemans W, Buchholz S, Veenstra J. Memory-based shallow parsing. EACL’99 workshop on Computational Natural Language Learning (CoNLL-99), 1999: 53�60

10. Garrido A A, Gilabert Z P, Pérez-ortiz J A, et al. Shallow parsing for Portuguese-Spanish machine translation. Tagging and Shallow Processing of Portuguese: workshop notes of TASHA'2003, 2004: 21�24

11. McDonald R, Pereira F, Ribarov K, et al. Non-projective dependency parsing using spanning tree algorithms. Conference on Human Language Technology and Empirical Methods in Natural Language Processing, 2005: 523�530

12. Tapanainen P, Järvinen T. A non-projective dependency parser. The Fifth Conference on Applied Natural Language Processing, 1997: 64�71

13. Kudo T, Matsumoto Y. Japanese dependency structure analysis based on support vector machine. SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. 2000, 13: 18�25

14. Kudo T, Matsumoto Y. Japanese dependency structure analysis using cascaded chunking. 6th Conference on Natural Language Learning, 2002, 20: 1�7

15. Zhou M. A block-based robust dependency parser for unrestricted Chinese text. The Second Workshop on Chinese Language Processing, 2000, 12: 78�84

16. Zhang Y, Zens R, Ney H. Chunk-level reordering of source language sentences with automatically learned rules for statistical machine translation. SSST, NAACL-HLT 2007/AMTA Workshop on Syntax and Structure in Statistical Translation, 2007: 1�8

17. Wang C, Collins M, Koehn P. Chinese syntactic reordering for statistical machine translation. 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007: 737�745

18. Huang Z. HNC (hierarchical network of concepts) theort. CN: Tsinghua University Press, 1998 (in Chinese)

19. Jin Y. Natural language understanding based on the theory of HNC (hierarchical network of concepts). CN: Science Press, 2005 (in Chinese)

20. Jin Y. A hybrid-strategy method combining semantic analysis with rule-based MT for patent machine translation. 2010 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE), 2010: 1�4

21. Papineni K, Roukos S, Ward T, et al. BLEU: a method for automatic evaluation of machine translation. Technical Report RC22176 (W0109-022), IBM Research Report