Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June...

34
Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013

Transcript of Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June...

Page 1: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Machine Translation activities at WIPO

Bruno Pouliquen, Christophe Mazenc

Patentscope workshop June 2013

Page 2: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Agenda

1. History of machine translation activities at WIPO

2. Cross Lingual Search

3. Integration of third party MT engines

4. Development of in-house MT engines tuned for specific tasks

5. Strategy

Page 3: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

1.History of MT activitesAt WIPO

Page 4: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

MT at WIPO: history

Why is WIPO interested in Machine Translation?

The IB of the PCT is responsible for translating titles, abstracts, drawing legends, search reports, written opinions and IPRPs for the published PCT applications.

(This represents xx millions of words translated per year)

WIPO is disseminating multi lingual Patent Information through it’s portal PATENTSCOPE. Multi lingual functions are required to enable the largest number of users worldwide to search and browse patent applications in many different languages

Page 5: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

MT at WIPO: an overview

Mid 2007: International RFP to implement “cross lingual Search” functions in PATENTSCOPE

End of 2008: project failure due to the supplier’s inability to deliver a quality product

2009: First Statistical Machine Translation experiments performed in-house. Development of a first engine to translate titles from English to French

2009-2010: development of the PATENTSCOPE CLIR system in 5 languages (EN, FR, DE, ES, JA)

Page 6: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

MT at WIPO: an overview

Summer 2010: Integration of Google Translate in PATENTSCOPE to translate result lists, descriptions and claims

March 2011: development and deployment of WIPO’s first own MT system tuned for patents’ titles and abstracts (TAPTA)

April 2011: extension of CLIR to cover the Chinese, Korean, Russian and Portuguese languages

August 2011: release of PCT corpus: COPPA

Page 7: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

MT at WIPO: an overview

November 2011: Integration of KIPO’s machine translation system in PATENTSCOPE (for the KOEN language pairs) (until December 2012)

December 2011: Integration of Microsoft Translate into PATENTSCOPE

January 2012: extension of CLIR to cover the Dutch, Italian, and Swedish languages

Avril 2012: PATENTSCOPE CLIR functionality integrated into Minesoft’s PATBASE through a web service

Page 8: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

MT at WIPO: an overviewJune 2012: provide MT transfer knowledge to UN and ITU

October 2012: UN, ITU, Wipo Marks in production

November 2012: Extension of Tapta to cover Japanese and German

February 2013: Evaluation results: Tapta better than Microsoft and Google (title+abtract in all language pairs, similar results in UN)

June 2013: Outsourcing contract using TAPTA for the EN=>FR language pair

Page 9: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

2. CLIR( Cross Lingual Information Retrieval)

Page 10: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

WIPO’s Cross-lingual search: principle

► Free tool available at http://patentscope.wipo.int/search/clir/clir.jsp?interfaceLanguage=en

► Enter a search query in either EN, DE, ES, FR, JP, RU, ZH, PT, IT, DU, SE and it will be expanded into the other languages (keywords translation)

► Automatic or supervised mode► balance between precision and recall set by the user► Disambiguation by technical domains and by selection

of appropriate synonyms► Built from bilingual dictionaries extracted statistically

from Patent corpuses without supervision

Page 11: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Interface : Cross-lingual (CLIR)- Automatic

Page 12: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

(EN_TI:("hearing aids" OR "hearing prosthetic"~21 OR "auditory aids"~21 OR "auditory prosthetic"~21) OR EN_AB:("hearing aids" OR "hearing prosthetic"~21 OR "auditory aids"~21 OR "auditory prosthetic"~21)) OR (DE_TI:("Hörgeräte" OR "Hörhilfegeräten") OR DE_AB:("Hörgeräte" OR "Hörhilfegeräten")) OR (ES_TI:("audífonos") OR ES_AB:("audífonos")) OR (FR_TI:("audioprothèses" OR "appareils de correction auditive" OR "production d'appareils auditifs") OR FR_AB:("audioprothèses" OR "appareils de correction auditive" OR "production d'appareils auditifs")) OR (JA_TI:(" 穴形補聴器 ") OR JA_AB:(" 穴形補聴器 ")) OR (KO_TI:(" 보청 ") OR KO_AB:(" 보청 ")) OR (PT_TI:("audiofone" OR "auxìlio de audição") OR PT_AB:("audiofone" OR "auxìlio de audição")) OR (RU_TI:("слуха протезно"~22 OR "прослушивания протезно"~22 OR "слуха спидом"~22 OR "слуха наведения"~22 OR "прослушивания спидом"~22 OR "прослушивания наведения"~22 OR "слухоулучшающих протезно"~22 OR "слуховой протезно"~22 OR "слухоулучшающих спидом"~22) OR RU_AB:("слуха протезно"~22 OR "прослушивания протезно"~22 OR "слуха спидом"~22 OR "слуха наведения"~22 OR "прослушивания спидом"~22 OR "прослушивания наведения"~22 OR "слухоулучшающих протезно"~22 OR "слуховой протезно"~22 OR "слухоулучшающих спидом"~22)) OR (ZH_TI:(" 助听器 ") OR ZH_AB:(" 助听器 "))

CLIR: automatically enriched query

Page 13: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Why use PATENTSCOPE CLIR?A) Search full text collections simultaneously in many foreign languages

without knowing them (not English centric)

B) Improve significantly the number of relevant results without increasing significantly the number of irrelevant results

3356 results in English titles or abstracts for hearing AND aids

3825 results obtained with CLIR searching in titles or abstracts in all languages

C) Have confidence in your searches:

No black box: users have access to the CLIR generated boolean queries (albeit complex) and have the full control on them

D) Have a responsive system even for complex queries

the query in the previous slide executes in less than 1/2sec in PATENTSCOPE

Page 14: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

What next?

Improve terminology coverage of already supported languages

Add other languages (Arabic)?

Condition to add a language:

Having more than 200’000 (ideally 500’000) titles and if possible abstracts in the language available with associated high quality translations in English

Page 15: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

3. Integrated third-party MT engines

Page 16: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

9 Interface languages:Deutsch |English|Español |Français | | |Português |Русский | |日本語 한국어 中文

Page 17: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Integrated 3rd party MT: principles

► Use free MT services available on the internet (so far Google Translate and Microsoft translate)

► Translates from the source language(s) to the language set by the user in the graphical interface

► Translates results lists and description and claims only when requested by the user

► 65 languages supported using Google Translate!► Quality of Google Translate improved for patent texts

thanks to EPO sharing patent corpora with Google

Page 18: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Search Results – machine translate

Page 19: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Search Results – machine translate

Page 20: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Search Results – machine translate

Page 21: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Description – machine translate

Page 22: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Description – machine translate

Page 23: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Description – machine translate

Page 24: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Description – machine translate

Page 25: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

4. Development of in-house MT engines tuned for specific

tasks

Page 26: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

In-house MT engines

MT systems building expertise developed in-house since 2009

Corpora approach: started using PCT corpus of titles and abstracts

Uses open source Statistical Machine Translation: Moses (WIPO is a committer with a specific branch)

First system developed: Translation Assistant for Patent Titles and Abstracts (TAPTA: publicly available at https://www3.wipo.int/patentscope/translate)

Same system (trained on different corpora) developed for the United Nations, for ITU and for translation of Madrid Trademarks goods and services

Page 27: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

TAPTA

Hovering the mouse on the left highlights corresponding segment on the right (and vice-versa)

Hovering the mouse on the left highlights corresponding segment on the right (and vice-versa)

Page 28: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

How well does it work?Tapta better than Google and Microsoft for abstracts

English->French: Tapta BLEU 46.915 abstracts*Google 45.9 / Google-EPO 45.8 / Microsoft 36.7

German->English: Tapta BLEU 38.311 title & abstracts*Google 37.8 / Microsoft 26.8Human evaluation: adequacy/fluency (Tapta: 79%, Google 65%, Microsoft 67%)

English->Japanese: Tapta BLEU 25.4 1000 segments (title & abstract)*Google BLEU 22.3

English->Chinese: Tapta BLEU 22 1000 segments (title & abstract)*Google BLEU 17.5

(*) from recent patent applications (published in March 2013), compared to one reference

Page 29: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Also in United Nations

Aims at assisting UN translators when translating UN official documents from AR,ES,FR,RU,ZH into EN (both directions)

Language pair Tapta

Google Bing

ar-en 55.25 n/a[1] 51.17

en-ar 44.10 33.74 28.94

en-es 61.81 53.39 46.86

en-fr 51.23 45.58 42.19

en-ru 50.85 39.67 38.96

en-zh 43.17 34.16 32.77

es-en 60.32 52.54 49.18

fr-en 53.36 46.46 43.39

ru-en 58.56 47.71 47.09

zh-en 42.31 36.55 30.60

BLEU scores

Page 30: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Findings

Customized MT engines built on narrow language domains outperform state of the art general purpose MT engines

TAPTA automatic evaluations are better than Google Translate on patent titles and abstracts (BLEU scores)

Size of corpora matters, as well as quality of sentence-pairs alignments

Building customized SMT engines is sustainable and does not require large human, IT and financial resources

Page 31: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

BibliographyBibliographyTAPTA: A user-driven translation system for patent documents based on domain-aware Statistical Machine Translation,  B. Pouliquen, C. Mazenc, A. Ioro in proceedings of the European Association for Machine Translation conference, May 2011, Leuven Belgium

COPPA, CLIR and TAPTA: three tools to assist in overcoming the Patent language barrier at WIPO, B. Pouliquen, C. Mazenc in proceedings of Machine Translation Summit 2011, September 2011 Xiamen China

Statistical Machine Translation prototype using UN parallel documents, B. Pouliquen, C. Mazenc, C. Elizalde, J. Garcia-Verdugo in proceedings of the 16th EAMT conference, 28-30 May 2012, Trento, Italy

(forthcoming ) Large-scale multiple language translation accelerator at the United Nations, B. Pouliquen, C. Elizalde, M, Junczys-Dowmunt, C. Mazenc, J. Garcia-Verdugo in proceedings of Machine Translation Summit 2013, Nice, France

Page 32: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

5. Strategy

Page 33: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

WIPO’s MT strategyMake best use of state-of-the-art technologies available in open source and promote further their development

Adapt these technologies to the patent domain (using Patent corpora, Patent classification,…) for practical use cases

Develop patent MT systems and put them at disposal of the largest number of users to bridge the language barrier (notably in patent searching)

Cooperate with interested offices by sharing experience, corpora and software solutions

Adopt a barrier free dissemination of patent corpora when possible to foster research in MT for patent texts

Investigate Cloud technologies to be able to ramp up to industrial internet solutions

TAPTA: Extend coverage (languages, claims, descriptions)

Page 34: Machine Translation activities at WIPO Bruno Pouliquen, Christophe Mazenc Patentscope workshop June 2013.

Questions?