CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini...

49
CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley

Transcript of CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini...

Page 1: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

CS 188: Artificial IntelligenceSpring 2007

Lecture 25: Machine Translation

4/24/2007

Srini Narayanan – ICSI and UC Berkeley

Page 2: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Announcements

Assignment 7 is up. Grid-world and robot crawler. Due 5/3.

Extra Office Hours first two weeks of May This week as usual Thursday 11-1 PM

5/2 extra (Tuesday 11-1 PM) 5/3 usual 11-1 PM

Next assignment (not graded) will be a final exam review.

Page 3: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Reinforcement Learning

What you should know MDPs

Basics, discounted reward Policy Evaluation Bellman’s equation Value iteration Policy iteration

Reinforcement Learning Adaptive Dynamic Programming TD learning (Model-free) Q Learning

Page 4: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Where we are

Past: Basic Techniques of AI

Search, Representation, Uncertainty and Inference, Learning

Next Applications

MT, NLU (this week) Neural Computation, Perception (next week).

Today: Machine Translation (MT) (Semi) Automatically translating text/speech from

one language to another.

Page 5: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Translation is hard

• In a Bucharest hotel lobby.• The lift is being fixed for the next day. During that time we regret that

you will be unbearable.• In a Paris hotel elevator:

• Please leave your values at the front desk.• In a hotel in Athens:

• Visitors are expected to complain at the office between the hours of 9 and 11 a.m. daily.

• In a Japanese hotel:• You are invited to take advantage of the chambermaid.

• In the lobby of a Moscow hotel across from a Russian Orthodox monastery:

• You are welcome to visit the cemetery where famous Russian and Soviet composers, artists, and writers are buried daily except Thursday.

Page 6: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

MT History

1946 (Pre-AI) Booth and Weaver discuss MT at Rockefeller foundation in New York;

1947-48 idea of dictionary-based direct translation

1949 Weaver memorandum popularized idea 1952 all 18 MT researchers in world meet at MIT 1954 IBM/Georgetown Demo Russian-English

MT 1955-65 lots of labs take up MT

Page 7: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Early translation problems

English to Russian to English The spirit is willing but the flesh is weak. The vodka is good but the meat is rotten.

Page 8: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

History of MT: Pessimism

1959/1960: Bar-Hillel “Report on the state of MT in US and GB” Argued FAHQT too hard (semantic ambiguity, etc) Should work on semi-automatic instead of automatic His argument

Little John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy.

Only human knowledge let’s us know that ‘playpens’ are bigger than boxes, but ‘writing pens’ are smaller

His claim: we would have to encode all of human knowledge

Page 9: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

History of MT

Systran (Babelfish) been used for 30 years 1970’s:

European focus in MT; mainly ignored in US 1980’s

ideas of using AI techniques in MT (KBMT, CMU) 1990’s

Commercial MT systems Statistical MT (SMT), Speech-to-speech translation

2000’s SMT matures to be an exciting AI technology

Well funded, high-payoff, can make a real difference.

Page 10: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Levels of Transfer

Interlingua

SemanticStructure

SemanticStructure

SyntacticStructure

SyntacticStructure

WordStructure

WordStructure

Source Text Target Text

SemanticComposition

SemanticDecomposition

SemanticAnalysis

SemanticGeneration

SyntacticAnalysis

SyntacticGeneration

MorphologicalAnalysis

MorphologicalGeneration

SemanticTransfer

SyntacticTransfer

Direct

(Vauquois triangle)

Page 11: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

What makes a good translation

Translators often talk about two factors we want to maximize:

Faithfulness or fidelity How close is the meaning of the translation to the

meaning of the original (Even better: does the translation cause the

reader to draw the same inferences as the original would have)

Fluency or naturalness How natural the translation is, just considering its

fluency in the target language

Page 12: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

The Coding View

“One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’ ”

Warren Weaver (1955:18, quoting a letter he wrote in 1947)

Page 13: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

MT System Components

sourceP(e)

e f

decoder

observed

argmax P(e|f) = argmax P(f|e)P(e)e e

e fbest

channelP(f|e)

Language Model Translation Model

Finds an English translation which is both fluent and semantically faithful to the French source

Page 14: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

The Classic Language ModelWord N-Grams

Generative approach: w1 = STARTrepeat until END is generated:

produce word w2 according to a big table P(w2 | w1)w1 := w2

P(I saw water on the table) =

P(I | START) *P(saw | I) *P(water | saw) *P(on | water) *P(the | on) *P(table | the) *P(END | table)

Probabilities can be learnedfrom online English text.

w1 w2 wn-1 ENDSTART

Page 15: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Parallel Corpora

Parallel corpora (or bitexts) Collection of source-

target translation pairs Main resource for

learning a translation model

Either naturally occurring (e.g. parliamentary proceedings, news translation services) or commissioned

Page 16: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Building a Translation Model

Steps in building a simple statistical translation model Match up words in

training sentence pairs (word alignment)

Learn a lexicon from these alignments

Learn larger phrases

Whatis

theanticipate

dcost

ofcollecting

fees under

the new

proposal?

En vertu delesnouvelles propositions, quel est le coût prévu de perception de les droits?

Page 17: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 18: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 19: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Page 20: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

???

Page 21: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 22: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 23: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 24: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

???

Page 25: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 26: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

process ofelimination

Page 27: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

cognate?

Page 28: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp }

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

zerofertility

Page 29: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa

It’s Really Spanish/English

1a. Garcia and associates .1b. Garcia y asociados .

7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .

2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .

8a. the company has three groups .8b. la empresa tiene tres grupos .

3a. his associates are not strong .3b. sus asociados no son fuertes .

9a. its groups are in Europe .9b. sus grupos estan en Europa .

4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .

10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .

5a. its clients are angry .5b. sus clientes estan enfadados .

11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .

6a. the associates are also angry .6b. los asociados tambien estan enfadados .

12a. the small groups are not modern .12b. los grupos pequenos no son modernos . 

Page 30: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Statistical Machine Translation

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

All word alignments equally likely

All P(french-word | english-word) equally likely

Page 31: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Statistical Machine Translation

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

“la” and “the” observed to co-occur frequently,so P(la | the) is increased.

Page 32: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Statistical Machine Translation

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

“house” co-occurs with both “la” and “maison”, butP(maison | house) can be raised without limit, to 1.0,

while P(la | house) is limited because of “the”

(pigeonhole principle)

Page 33: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Statistical Machine Translation

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

settling down after another iteration

Page 34: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Statistical Machine Translation

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

Inherent hidden structure revealed by EM training!For details, see:“A Statistical MT Tutorial Workbook” (Knight, 1999). “The Mathematics of Statistical Machine Translation” (Brown et al, 1993) Software: GIZA++

Page 35: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Decoding

Now we have a phrase table: A huge list of translation phrases (e.g. 1M

phrases) Each phrase has a probability P(f|e)

When we see a new input sentence: Grow a translation left to right Extend translation using known phrases Also multiply by language model score

Page 36: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

The Pharaoh Decoder

Probabilities at each step include LM and TM

Page 37: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Recent Progress in Statistical MT

insistent Wednesday may recurred her trips to Libya tomorrow for flying

Cairo 6-4 ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment .

And said the official " the institution sent a speech to Ministry of Foreign Affairs of lifting on Libya air , a situation her receiving replying are so a trip will pull to Libya a morning Wednesday " .

Egyptair Has Tomorrow to Resume Its Flights to Libya

Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.

" The official said that the company had sent a letter to the Ministry of Foreign Affairs, information on the lifting of the air embargo on Libya, where it had received a response, the first take off a trip to Libya on Wednesday morning ".

20022002 20032003slide from C. Wayne, DARPA

Page 38: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Statistical Machine Translation

… la maison … la maison bleue … la fleur …

… the house … the blue house … the flower …

P(juste | fair) = 0.411P(juste | correct) = 0.027P(juste | right) = 0.020 …

new Frenchsentence

Possible English translations,to be rescored by language model

Page 39: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.
Page 40: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

What is MT not (yet) good for?

Really hard stuff Literature Natural spoken speech (meetings, court

reporting)

Really important stuff Medical translation in hospitals, 911

Page 41: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

What is MT good for?

Tasks for which a rough translation is fine Web pages, email Multilingual Speech-based queries

Tasks for which MT can be post-edited MT as first pass “Computer-aided human translation”

Tasks in sublanguage domains where high-quality MT is possible

Page 42: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

The next five years

Bootstrapping Resources Trying to design better learning methods to work from

scarce data (see Knight 2003, Plauche et al 2007) Germann and the ISI experiment in Tamil

MT in a month 100K tokens achieved tolerable performance in 2002

Including Syntactic/Semantic Information in SMT Markup on the Web Multi-lingual Lexical resources

WordNet PropBank FrameNet

Combining MT methods

Page 43: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.
Page 44: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Pos Language Family Script(s) Used Speakers Where Spoken (Major)

1 Mandarin Sino-Tibetan Chinese Characters 1051 China, Malaysia, Taiwan

2 English Indo-European Latin 510 USA, UK, Australia, Canada, New Zealand

3 Hindi Indo-European Devanagari 490 North and Central India

4 Spanish Indo-European Latin 425 The Americas, Spain

5 Arabic Afro-Asiatic Arabic 255 Middle East, Arabia, North Africa

6 Russian Indo-European Cyrillic 254 Russia, Central Asia

7 Portuguese Indo-European Latin 218 Brazil, Portugal, Southern Africa

8 Bengali Indo-European Bengali 215 Bangladesh, Eastern India

9 Indonesian MalayoPolynesian Latin 175 Indonesia, Malaysia, Singapore

10 French Indo-European Latin 130 France, Canada, West Africa, Central Africa

11 Japanese Altaic Chinese Characters and 2 Japanese Alphabets 127 Japan

12 German Indo-European Latin 123 Germany, Austria, Central Europe

13 Farsi (Persian) Indo-European Nastaliq 110 Iran, Afghanistan, Central Asia

14 Urdu Indo-European Nastaliq 104 Pakistan, India

15 Punjabi Indo-European Gurumukhi 103 Pakistan, India

16 Vietnamese Austroasiatic Based on Latin 86 Vietnam, China

17 Tamil Dravidian Tamil 78 Southern India, Sri Lanka, Malyasia

18 Wu Sino-Tibetan Chinese Characters 77 China

19 Javanese Malayo-Polynesian Javanese 76 Indonesia

20 Turkish Altaic Latin 75 Turkey, Central Asia

21 Telugu Dravidian Telugu 74 Southern India

22 Korean Altaic Hangul 72 Korean Peninsula

23 Marathi Indo-European Devanagari 71 Western India

24 Italian Indo-European Latin 61 Italy, Central Europe

25 Thai Sino-Tibetan Thai 60 Thailand, Laos

26 Cantonese Sino-Tibetan Chinese Characters 55 Southern China

27 Gujarati Indo-European Gujarati 47 Western India, Kenya

28 Polish Indo-European Latin 46 Poland, Central Europe

29 Kannada Dravidian Kannada 44 Southern India

30 Burmese Sino-Tibetan Burmese 42 Myanmar

Page 45: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Top Ten Internet Languages

Page 46: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

MT in Developing Countries

Traditional Rec

Community Rec

Page 47: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

Related Berkeley work atTIER

Kiosks / Livelihood Cellphones for pricing in rural Rwandan coffee markets Computers and livelihood development in urban slums in Brazil E-literacy / Entrepreneurship in rural Kerala

Education Studies of social impacts of Computer Aided Learning in rural areas Observations of shared computer usage among children in resource strapped areas

Telemedicine Long-distance diagnosis using 802.11b

Teaching ‘Technology and Development’ graduate class design (see reader/syllabus)

Conference First peer-reviewed IEEE/ACM conference in series

Page 48: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

URL bibliography http://www.cicc.or.jp—CICC website. http://nespole.itc.it—NESPOLE! website. http://www.umiacs.umd.edu—UMIACS website. http://www.isi.edu. http://www-2.cs.cmu.edu. http://www.lti.cs.cmu.edu. http://blombos.isi.edu—DINO browser. http://www-2.cs.cmu.edu—Enthusiast. http://www.ll.mit.edu—CCLINC. http://www-2.cs.cmu.edu—Speechalator. http://isl.ira.uka.de—FAME. http://www.cogsci.princeton.edu—WordNet. http://www.globalwordnet.org—Global WordNet Association. http://www.illc.uva.nl—EuroWordNet. http://www.sfs.nphil.uni-tuebingen.de—GermaNet. http://www.ceid.upatras.gr—BalkaNet. http://www.keenage.comChinese HowNet. http://www.gittens.nl—Mimida multilingual semantic network. http://www.icsi.berkeley.edu—FrameNet project. http://www.coli.uni-sb.de—SALSA project. http://www.nak.ics.keio.ac.jp—FrameNet project for Japanese. http://gemini.uab.es—FrameNet project for Spanish. http://www.cis.upenn.edu—PropBank project. http://www.cis.upenn.edu—VerbNet. http://www.cis.upenn.edu—combination of VerbNet and FrameNet. http://nlp.cs.nyu.edu—The NomBank

Page 49: CS 188: Artificial Intelligence Spring 2007 Lecture 25: Machine Translation 4/24/2007 Srini Narayanan – ICSI and UC Berkeley.

References