Introduction to the Language Technologies Institute

22
Introduction to the Language Technologies Institute Fall, 2008 Jaime Carbonell [email protected]

description

Introduction to the Language Technologies Institute. Fall, 2008 Jaime Carbonell [email protected]. School of Computer Science at Carnegie Mellon University. Computer Science Department (theory, systems) Robotics Institute (space, industry, medical) - PowerPoint PPT Presentation

Transcript of Introduction to the Language Technologies Institute

Page 1: Introduction to the  Language Technologies Institute

Introduction to the Language Technologies Institute

Fall, 2008Jaime [email protected]

Page 2: Introduction to the  Language Technologies Institute

School of Computer Science at Carnegie Mellon University

Computer Science Department (theory, systems)

Robotics Institute (space, industry, medical)

Language Technologies Institute (MT, speech, IR)

Human-Computer Interaction Inst. (Ergonomics)

Institute for Software Research Int. (SE)

Machine Learning Department (ML theory)

Entertainment Technologies (Animation, graphics)

Page 3: Introduction to the  Language Technologies Institute

Language Technologies Institute Founded in 1986 as the Center for

Machine Translation (CMT). Became Language Technologies

Institute in 1996, unifying CMT, Comp Ling program.

Current Size: 197 FTEs 27 Faculty (including joint appointments) 25 Staff 125 Graduate Students (90 PhD, 40 MLT) 10 Visiting Scholars

Page 4: Introduction to the  Language Technologies Institute

LTI Bill of Rights Get the rightright information To the right people At the right time On the right medium In the right language At the right level of detail

Page 5: Introduction to the  Language Technologies Institute

Slogan Challenges …right

information …right people …right time …right medium …right language …right detail

IR, filtering, TC, … routing,

personalization, … anticipatory analysis, … text, speech, video, … translation, bio, … summarization,

expansion

Page 6: Introduction to the  Language Technologies Institute

“…on the Right Medium” Speech Recognition

SPHINX (Reddy, Rudnicky Rosenfeld, …) JANUS (Waibel, Schultz, …)

Speech Synthesis Festival (Black, Lenzo)

Handwriting & Gesture Recognition ISL (Waibel, J. Yang)

Multimedia Integration (CSD) Informedia (Wactlar, Hauptmann, …)

Page 7: Introduction to the  Language Technologies Institute

“… in the Right Language” High-Accuracy Interlingual MT

KANT (Nyberg, Mitamura) Parallel Corpus-Trainable MT

Statistical MT (Lafferty, Vogel) Example-Based MT (Brown, Carbonell) AVENUE Instructible MT (Levin, Lavie,

Carbonell) Multi-Engine MT (Lavie, Frederking)

Speech-to-speech MT JANUS/DIPLOMAT/AVENUE (Waibel,

Frederking, Levin, Schultz, Vogel, Lafferty, Black, …)

Page 8: Introduction to the  Language Technologies Institute

We also Engage in: Tutoring Systems (Eskenazi, Callan) Linguistic Analysis (Levin, Mitamura…) Dialog Systems (Rudnicky, Waibel, …) Computational Biology

Protein structure/function (Carbonell, Langmead)

DNA seq/motifs (Yang, Xing, Rosenfeld) Complex System Design (Nyberg, Callan) Machine Learning (Carbonell, Lafferty, Yang,

Rosenfeld, Xing, Cohen,…) Question Answering (Nyberg, Mitamura,…)

Page 9: Introduction to the  Language Technologies Institute

How we do it at LTI Data-driven

methods Statistical learning Corpora-based

Examples: Statistical MT Example-based MT Text categorization Novelty detection Translingual IR

Knowledge-based Symbolic learning Linguistic analysis Knowledge

represent. Examples:

Interlingual MT Parsing &

generation Discourse modeling Language tutoring

Page 10: Introduction to the  Language Technologies Institute

MMR Ranking vs Standard IR

query

documents

MMR

IR

λ controls spiral curl

Page 11: Introduction to the  Language Technologies Institute

Adaptive Filtering over a Document Stream

On-topic documents

Test documents

Current document: On-topic?

Training documents (past)time

Off-topic documents

Unlabeled documents

RF

Topic 1

Topic 2

Topic 3…

Page 12: Introduction to the  Language Technologies Institute
Page 13: Introduction to the  Language Technologies Institute

Types of Machine Translation

Interlingua

Syntactic Parsing

Semantic Analysis

Sentence Planning

Text Generation

Source (Arabic)

Target(English)

Transfer Rules

Direct: SMT, EBMT

Page 14: Introduction to the  Language Technologies Institute

EBMT Example

English: I would like to meet her.Mapudungun: Ayükefun trawüael fey engu.

English: The tallest man is my father.Mapudungun: Chi doy fütra chi wentru fey ta inche ñi chaw.

English: I would like to meet the tallest man Mapudungun (new): Ayükefun trawüael Chi doy fütra chi wentru Mapudungun (correct): Ayüken ñi trawüael chi doy fütra wentruengu.

Page 15: Introduction to the  Language Technologies Institute

Ambiguity Makes MT Hard

Word Senses for “line” (52 senses in Random House English-Japanese Dictionary)

Power line – densen (電線 ) Subway line – chikatetsu ( 地下鉄 )

(Be) on line – onrain (オンライン ) (Be) on the line – denwachuu (電話中 ) Line up – narabu (並ぶ ) Line one’s pockets – kanemochi ni naru (金持ちになる ) Line one’s jacket – uwagi o nijuu ni suru (上着を二重にする ) Actor’s line – serifu (セリフ ) Get a line on – joho o eru (情報を得る )

Page 16: Introduction to the  Language Technologies Institute

CONTEXT: More is Better “The line for the new play extended

for 3 blocks.” “The line for the new play was

changed by the scriptwriter.” “The line for the new play got

tangled with the other props.” “The line for the new play better

protected the quarterback.”

Page 17: Introduction to the  Language Technologies Institute

Primary SequenceMNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA

3D Structure

Folding

Complex function within network of proteins

Normal

PROTEINSSequence Structure Function

(Borrowed from: Judith Klein-Seetharaman)

Page 18: Introduction to the  Language Technologies Institute

Primary SequenceMNGTEGPNFY VPFSNKTGVV RSPFEAPQYY LAEPWQFSML AAYMFLLIML GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVFGG FTTTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLVGWSRYIP EGMQCSCGID YYTPHEETNN ESFVIYMFVV HFIIPLIVIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWLPYAG VAFYIFTHQG SDFGPIFMTI PAFFAKTSAV YNPVIYIMMN KQFRNCMVTT LCCGKNPLGD DEASTTVSKT ETSQVAPA

3D Structure

Folding

Complex function within network of proteins

Disease

PROTEINSSequence Structure Function

Page 19: Introduction to the  Language Technologies Institute

Predicting Protein Structures Protein Structure is a key determinant of protein function Crystalography to resolve protein structures experimentally in-vitro is

very expensive, NMR can only resolve very-small proteins The gap between the known protein sequences and structures:

3,023,461 sequences v.s. 36,247 resolved structures (1.2%) Therefore we need to predict structures in-silico

Page 20: Introduction to the  Language Technologies Institute

Linked Segmentation CRF

Node: secondary structure elements and/or simple fold Edges: Local interactions and long-range inter-chain and

intra-chain interactions L-SCRF: conditional probability of y given x is defined as

, , ,

1 1 , , ,,

1( ,..., | ,..., ) exp( ( , )) exp( ( , , , ))

i j G i j a b G

R R k k i i j l k i a i j a bV k lE

P f g yZ

y y y

y y x x x y x x y

Joint Labels

Page 21: Introduction to the  Language Technologies Institute

Discriminative Semi-Markov Model for Parallel Right-handed β-Helix Prediction

Structures A regular super secondary

structure with an an elongated helix whose successive rungs are composed of beta-strands

Conserved T2 turn

Computational importance Long-range interactions

Biological importance functions such as the bacterial

infection of plants, binding the O-antigen, antifreeze,...

Page 22: Introduction to the  Language Technologies Institute

Some LTI Accomplishments First large-scale web-spider (LYCOS) First speech-speech MT (JANUS) First high-accuracy text MT (KANT) First minority-language MT

(DIPLOMAT) First high-accuracy translingual IR First multidocument summarizer

(MMR)