Post on 11-Jun-2015
description
1
Towards OpenLogos Hybrid Translation Anabela Barreiro
INESC-ID anabela.barreiro@inesc-id.pt
2
Research goals
– OpenLogos – 1st hybrid open source machine translation solution
– Hybridization of the OpenLogos system consists on embedding linguistic
knowledge into statistical machine translation (SMT)
The timing is just right…
– Recognition by SMT researchers and developers of the need to integrate
linguistic knowledge in machine translation (MT) systems
– Benefit from cloud computing, big data and advanced alignment techniques,
which contribute to an easier and faster development of new language pairs
– Use crowd sourcing support to increase MT quality
Introduction with Contextual Information
3
The ideal platform for hybrid translation
– Logos legacy (one of the first RBMT systems - 1970)
– Logos Corporation – one of the longest run commercial MT companies in the
world (in business for over 30 years)
– The Logos MT product put its emphasis on semantic understanding
– The Logos approach was through linguistic analysis of English to render it in a
form that was “understood” by the computing system
– To a certain extent, the Logos approach is similar in spirit to the SMT approach,
and complements SMT by providing answers that help overcome statistical
weaknesses
Introduction with Contextual Information
4
The open source initiative
– OpenLogos is publicly available as open source software
– It has some enthusiastic advocates and fervent supporters in different parts of the
world who believe that:
• OpenLogos will be used as the rule-based component of a new linguistically
enhanced hybrid translation system
• The open source components of the OpenLogos will help the NLP/CL research
community make scientific advances
Introduction with Contextual Information
5
Background on OpenLogos MT
System pipeline architecture
SAL representation language
Classic problems with rule-driven systems
How SAL benefits translation
Advantages of the OpenLogos architecture
Uniqueness of the OpenLogos MT system
Exploiting OpenLogos resources for new applications
Availability of OpenLogos free resources
Presentation Outline
Open source copy of the Logos system (1970-2001) adapted by DFKI
– Developed in US, Germany, Italy
– 25-100 development staff for 30 years
– + 80 million US Dollar Investment
8 language pairs: EN-GE, EN-FR, EN-ES, EN-IT, EN-PT
GR-EN, GE-FR, GE-IT
Commercial product was considered high quality
Industrial strength MT used successfully in 12 countries
Users included: Ericsson of Sweden, the Canadian Secretary of State, SAP,
Siemens-Nixdorg, Oce Netherlands, and Union Fenosa
6
Background to OpenLogos
Multi-target System
– One source language analysis can generate any number of targets
Pipeline Architecture
Language-neutral Software
– All linguistic knowledge is in data files, stored in a relational database
Semantico-Syntactic Abstraction Language (SAL Representation)
– Taxonomy-ontology
– NL sentences entering the system are immediately converted into SAL sentences
– SAL is the driving force of the OpenLogos process
Semantic Processing
– Semantic Table (= SEMTAB) containing thousands of transformation rules
7
OpenLogos Characteristics
OpenLogos Pipeline Architecture
8
Format
RES1 RES2
P1 P2
P3 P4
S T4
T3
T1 T2
GEN
Format
SEMTAB
Target Rules SEMTAB
SEMTAB
SAL Rules
Target Rules
Target Rules
• Highly Modular
• Incremental Processing
• Multi-Target System
• Bottom-up Analysis
• Deterministic Parse
Input
Output SEMTAB
9
Clause Segmentation ways of cooking lentils - V
Homograph Resolution types of [cooking utensils] - ADJ
Deterministic parsing requires that all ambiguous PoS be resolved (98% precision)
Format
RES2
RES1
SAL Rules
SEMTAB
Enter Pipeline
Incremental Source Analysis - 1
10
Parse1
Parse3
Parse4
S
Parse2
• Simple NP • Semantic
resolution • NP Prep NP
• Relative clauses
• Semantic resolution
• Verb semantics
•Complex NP • Simple clauses
• Semantic resolution
•Order in complex
sentences • Semantic
resolution
SAL Rules Semtab
Incremental Source Analysis - 2
E.g: a book on the presidency
on = about; concerning
≠ a book on the table
on = over 10
SAL - Semantico-syntactic Abstraction Language
SAL Taxonomy: 3 levels organized hierarchically
– Supersets / Sets / Subsets
Semantico-Syntactic continuum from NL word to Word Class
– Literal word: airport
– Head morph: port
– SAL Subset: Agfunc (agentive functional location)
– SAL Set: func (functional location)
– SAL Superset: PL (place)
– Word Class: N
Both Pipeline Input Stream and Rulebases are expressed in SAL
11
SAL Representation Language
12
SAL Noun Supersets
E.g: two pieces of cake
NP parse must have:
- Plural morphology of pieces
- Semantics of cake
Developed:
- inductively
- by trial and error
- over a period of years
- by the development team
13
Abstract Noun Taxonomy
Abstract Noun Superset
Non-verbal Abstract Set
Non-verbal
Subsets
Verbal Abstract Set
Verbal
Subsets
Classifications
Methods / Procedures
14
Is the word cooking a verb or an adjective?
ways of cooking lentils
types of cooking utensils
ways N(AB/method) parser verb bias
types N(AB/class) non-verb bias
Use of SAL Codes to Resolve Homographs
SAL contributes to the resolution of the homograph
The SAL code N(AB/method) in the rule
matches on a similar code in the SAL input
stream.
The effect of such a match is to resolve
cooking as a verb
Rules Have Five Components
SAL Pattern
– PARSE2 example: N(IN/data;u) Prep(“on”;u) N(u;u) (a book on the presidency)
Constraints
– Match only if conditions are true or false
Source Actions
– RES Rulebase: Resolves syntactic ambiguity
– PARSE Rulebase: Creates parse tree
– SEMTAB Rules: Effects semantic disambiguation
Target Action (optional)
– Effects syntactic and/or semantic transfer
Comment Line
– PARSE2 example: NP(info) Prep(“on”) NP N1 “about” N2
E.g., book on political satire book about ....
15
What SAL Rules Look Like
Complexity
– Logic saturation
– Rulebase grows too large
– Performance degradation
– Difficult maintainability
– System improvability stasis
Ambiguity
– Quality/accuracy of output – depends on effective disambiguation
– Effective disambiguation cause rulebase growth
Classic Dilemma of the Developer
– Reduce rulebase size to relieve complexity weakens disambiguation
– Increase rulebase size to address ambiguities increases complexity
16
Classic Problem of RBMT
17
Complexity
– Rules and input stream are expressed as SAL patterns
– Homogeneous ‘apples-to-apples’ matching
– Rules are SAL patterns stored/organized in an indexed pattern dictionary
– SAL input stream serves as search argument to SAL rulebase
– No limit on rule size and no impact on performance
– Rules are self organizing
– Rulebase is easy to maintain
How OpenLogos Addresses Complexity and
Ambiguity
18
How Rules Are Applied
Metaphor: biological neural net
– Vectors labeled V1-V6 = SAL input stream of the pipeline
– Cells in input vectors = SAL elements/words to which the NL input stream has been
converted
– In this network, R1 through P4 = hidden layers containing SAL rules
– R1 represents RES1, P1 represents Parse1 and so on.
– Each hidden layer contains between 2-4 thousand rules, organized by their SAL
pattern, as in a dictionary.
As the analysis progresses:
1- cells become fewer
(abstract nature of the
parse)
2- vectors become lighter
(semantic dismbiguation)
19
Chief similarity
– Efficient interaction between the SAL input stream and the rules of the
hidden layers
– Only those rules which should be looked at are accessed
– The developer does not need to develop metarules or discrimination
networks to achieve efficiency in rule matching
– Efficiency in rule matching is an automatic by-product of system design
How Rules Are Applied
Metaphor: biological neural net
20
Ambiguity
– Syntactic Homograph Resolution
– Scoping of adjectives, prepositions
– Polysemy
How OpenLogos Addresses Complexity and
Ambiguity
21
Resolution of Polysemy in OpenLogos
SAL Representation Language in interaction with SEMTAB
SEMTAB provides a transfer that overrides the default dictionary transfer
for the verb “raise”
NL String SEMTAB Rule Portuguese Transfer
raise a child V(‘raise’) N(ANdes) criar. . .
raise corn V(‘raise’) N(MAedib) cultivar. . .
raise the rent V(‘raise’) N(MEabs) aumentar. . .
22
Deep Structure Rules of SEMTAB
A single deep-structure rule matches multiple surface-structures
and produces correct target transfers
he raised the rent ele aumentou a renda V+Object
the raising of the rent o aumento da renda Gerund
the rent, raised by … a renda, aumentada por… Part. ADJ
a rent raise um aumento de renda Noun
23
How SAL Benefits Translation
The situation was alluded to by my friend in his letter
Mon ami a fait allusion à la situation dans sa lettre
The situation was alluded to in their letter
On a fait allusion à la situation dans leur lettre
Examples showing voice transformations
EN passive voice >>> FR active voice
Voice transformations are possible due to: • incremental pipeline approach • strong semantic sensitivity
24
Creation of systems involving small or neglected/endangered languages
– not targeted by commercial programs
– to fulfil the goals of administrations and NGOs dealing with these
languages, contributing to their promotion and/or revival
Freely available
– any user can access the technology
Customizable - institutions or businesses adopting an open-source MT can
customize the system to their needs in many ways
– developing new linguistic data (vocabularies, rules, corpora)
– integrating system/data with other packages
– etc.
Advantages of OpenLogos
Machine Translation Architecture
25
Extensible dictionaries with underlying semantic foundation
Analyses whole source sentences, considering:
– Morphology
– Meaning (semantics)
– Grammatical structure and function
Semantico-Syntactic Abstraction Language (SAL)
– the parser is able to achieve better results than syntactic analysis alone
would allow.
Parsing is only source language specific; generation is target language
specific
Originally a transfer approach, evolved to the present system (which has
interlingual features inherent to the system)
OpenLogos Uniqueness
26
OpenLogos comprehensive analysis permits to construct a complete and
idiomatically correct translation in the target language
OpenLogos is suitable for research and academic use
– make OpenLogos the standard MT platform for universities, education and
other governmental institutions
– bring new life into a dormant technology (Phoenix rising metaphor)
OpenLogos linguistic data representation can be established as the
foundation
– freely available for private and commercial use
– there is still need for the provision of linguistic and technical services
and/or customer support on a fee basis
– packaging OpenLogos with the top five Linux distributions will generate a
constant revenue stream
OpenLogos has an ideal platform for a hybrid MT solution
OpenLogos Uniqueness
27
SPIDER
– System for Paraphrasing In Document Editing and Revision.
– Based on NooJ’s technology (http://ww.nooj4nlp.net/)
– Publicly available at: http://www.linguateca.pt/ReEscreve/
– Designed to help with writing optimization, but its applicability extends to MT
pre-editing.
1st version – ReEscreve (for Portuguese) and ReWriter (for English)
2nd version – eSPERTo (Portuguese: the smart/clever one; expert)
Designed for integration in a cyber school project within the scope of an
educational program to teach students how to improve their writing skills in
the Portuguese language
EXPERT (prototype) - to assist writing of domain-specific texts
Initially, OpenLogos EN-PT dictionary data were adapted and enhanced with new properties (derivational, etc.) to create a new resource:
Port4NooJ (http://www.linguateca.pt/Repositorio/Port4NooJ/). ReEscreve uses Port4NooJ.
Contribution of OpenLogos Resources for New NLP
Applications
28
ParaMT
– Bilingual/multilingual paraphraser (translator prototype)
– Uses similar methodology to that employed by SPIDER
– Uses bilingual data
– Directly applicable to MT
Corpógrafo
– Multilingual corpora management tool
– Available at: http://www.linguateca.pt/corpografo/
Contribution of OpenLogos Resources for New NLP
Applications
30
– Authoring aid (word processing applications)
– Language composition tool
– Text production and style editor
– Empirical testbed for linguistic quality assurance
– Text (pre-)editing (machine translation)
– “Revision memory” tool (≈ “translation memory”)
– Applicable to general and technical language
When integrating terminologies, it helps writing in technical domains
(e.g. student texts - ReWriter or legal texts - EXPERT)
Uses of SPIDER
31
ReEscreve: Suggestions for Text Rewriting
Paraphrases of SVC presented by ReEscreve’s
paraphrasing system
32
ReEscreve: a Rewritten Text
Text rewritten based on the user’s preferences
Users can suggest new expressions!
34
Suggestions for Text ReWriting
Suggestions for general language
linguistic phenomena
Compound adverbs
> single adverbs
Support verb constructions
> single verbs
Relatives > participial
adjectives
35
Selection of paraphrasing grammars for specific
linguistic phenomena
Users can select among general and technical dictionaries (more than one selection allowed),
grammars for specific linguistic transformations (one, several or all grammars can be selected).
The interface provides sample texts for testing.
Sample LEGAL
text
Informative details about the
linguistic resources selected
36
Identification of legal terms in the text
Suggestions for the term “breach of
law”
Users can select one term from the list of suggestions or provide a new
suggestion
Selection of a Domain Dictionary
37
Suggestions provided and user’s capability to add
new rewriting options
Text rewritten
• In red, the expressions in the source text
• In green, suggestions provided by SPIDER and selected by the user
The user can suggest new words or
expressions (synonyms or paraphrases)
It is possible to go back and change the
user option as many times as necessary
38
Recognition of Portuguese SVC and translation
into English verbs
MACHINE
TRANSLATION
ParaMT: a Paraphraser Applicable to MT
$EN
EN verbs PT support verb construction
>
39
Selected Publications on Paraphrasing Applications
Anabela Barreiro. "SPIDER: a System for Paraphrasing In Document Editing and Revision -
Applicability in Machine Translation Pre-Editing". Computational Linguistics and
Intelligent Text Processing. Proceedings of the 12th International Conference 6609 (2011),
pp. 365-376. Springer. ISSN: 0302-9743. e-ISSN: 1611-3349. DOI: 10.1007/978-3-642-
19400-9. Part II, Lecture Notes in Computer Science
Anabela Barreiro. "ParaMT: a Paraphraser for Machine Translation". In António Teixeira, Vera
Lúcia Strube de Lima, Luís Caldas de Oliveira & Paulo Quaresma (eds.), Computational
Processing of the Portuguese Language, 8th International Conference, Proceedings
(PROPOR 2008) Vol. 5190, (Aveiro, Portugal, 8-10 de Setembro de 2008), Springer Verlag.
Lecture Notes in Computer Science,pp. 202-211.
Anabela Barreiro & Luís Miguel Cabral. "ReEscreve: a translator-friendly multi-purpose
paraphrasing software tool". In Marie-Josée Goulet, Christiane Melançon, Alain Désilets &
Elliott Macklovitch (eds.),Proceedings of the Workshop Beyond Translation Memories: New
Tools for Translators, The Twelfth Machine Translation Summit (Château Laurier, Ottawa,
Ontario, Canada, 29 August 2009), pp. 1-8.
40
Anusaaraka group at LTRC, IIIT-Hyderabad
– Integrating OpenLogos in their English to Hindi Language accessor
– An OpenLogos-based English-Hindi MT prototype is already functional,
but needs refinement before release
Chaudhury, S.; Rao, A.; Sharma, D. M. (2010). "Anusaaraka: An Expert System based
Machine Translation System". In Proceedings of 2010 IEEE International Conference on
Natural Language Processing and Knowledge Engineering (IEEE NLP-KE2010), Beijing,
China, Aug 21- 23, 2010.
Kalinga Institute of Industrial Technology, KIIT
– Setting up a research lab with MT based on OpenLogos technology
OpenLogos for Indian Languages
41
Department of Political, Social and Communication Sciences,
University of Salerno
– PhD dissertation where the OpenLogos English-Italian SEMTAB rules
methodology was applied, supported with the NooJ NLP environment to
represent the theoretical and methodological principles of the Lexicon-
Grammar Theory
Monti, Johanna (2013). Multi-word unit processing in Machine Translation. Developing and
using linguistic resources for multi-word unit processing in Machine Translation
Southern African main universities
– Initial efforts to bring OpenLogos as a MT platform for translation
between English and the African languages (scarce resources, lack of
parallel corpora, etc.) in a initiative similar to that one done for Indian
languages
Other Efforts with OpenLogos
42
The Language Technology Lab of DFKI has adapted OpenLogos from the
commercial Logos System
Also at Sourceforge under a GPL license
http://openlogos-mt.sourceforge.net/
OpenLogos employs only open source components:
– Use of open source development tools and compilers, such as GCC
– Replacement of non-open code and libraries
– Use of open source databases instead of a commercial database. All
language specific resources have been converted to PostgreSQL
– Use of open standards instead of vendor specific protocols
– As a proof of concept for the software migration, Linux is used as target
platform for the first open source release of Logos
OpenLogos Resources at DFKI
43
Core code libraries of the server side system and basic executables to start
and run the system (APITest, logos_batch)
Resources, such as analysis (RES) and transfer (TRAN) grammars for
source and target languages, and a multi-language dictionary database
Tools: LogosTermBuilder, User administration (LogosAdmin), Command
line tools (APITest, openlogos), and multi-user GUI for initiating and
inspecting translation jobs and results (LogosTransCenter)
OpenLogos Components
DFKI hosts an open OpenLogos mailing list dedicated to discussion
and exchange of information concerning OpenLogos developments and
problems at:
http://www.dfki.de/mailman/listinfo/openlogos-list
LinkedIn Discussion Group on OpenLogos Machine Translation
OpenLogos Facebook page
44
DFKI User Assistance with OpenLogos
45
Selected Publications
A few publications and technical papers are available with description of
the SAL representation language
the system architecture and workflow
Anabela Barreiro, Bernard Scott, Walter Kasper and Bernd Kiefer. OpenLogos Rule-Based
Machine Translation: Philosophy, Model, Resources, and Customization. In Machine
Translation, volume 25 number 2, Pages 107-126, Springer, Heidelberg, 2011. ISSN: 0922-
6567. DOI: 10.1007/s10590-011-9091-z
Bernard Scott and Anabela Barreiro. OpenLogos MT and the SAL Representation Language.
In Proceedings of the First International Workshop on Free/Open-Source Rule-Based
Machine Translation. Edited by Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Francis
M. Tyers. Alicante, Spain: Universidad de Alicante. Departamento de Lenguajes y Sistemas
Informáticos. 2–3 November 2009, pp. 19–26
Bernard Scott. The Logos Model: an Historical Perspective. In Machine Translation, vol. 18
(2003), pp. 1–72.
46
Towards OpenLogos Hybrid Translation Anabela Barreiro
INESC-ID anabela.barreiro@inesc-id.pt