Computational Linguistics at OSU Chris Brew Linguistics, Cognitive Science and CSE The Ohio State...

Computational Linguistics at OSU

Chris BrewLinguistics, Cognitive Science and CSEThe Ohio State University

Who am I?

Chris Brew, Associate Professor Full-time in NLP since about 1984.

B.Sc Chemistry (Bristol) Masters and Ph.D (Sussex)

NLP done in a Psychology department! Research positions at Sussex, Edinburgh

and in industry (Sharp) Faculty in Linguistics at Ohio State since 2000 Joint appointment in CSE

What I’ve done

Parsing and Dialogue Machine Translation (teaching class now) XML and corpus annotation Learning word meanings from large datasets Sound/Meaning relations Other stuff…

Linguistics

Linguistics is the scientific study of language and communication.

Linguists run experiments, do surveys, build simulations, do proofs.

Linguistics at OSU is: In the top 10 nationally Diverse and open-minded

Strengths of Linguistics at OSU Syntax, Semantics, Pragmatics Phonetics: the study of how people make and

perceive the sounds of language Psycholinguistics: the study of how people

process sounds, words, sentences, intonation Sociolinguistics: the study of how society and

social situations change the way we speak. Computational Linguistics and NLP

Computational Linguistics at OSU 3 faculty members and 20 students based in

Linguistics (Oxley Hall) Detmar Meurers (Parsing, Corpus Annotation, Computer-

aided Language Learning) Chris Brew (Statistical NLP) Michael White (Natural Language Generation)

Close ties with Drs. Byron and Fosler-Lussier in CSE.

We are willing and able to advise or co-advise on research, and have projects that cross the departmental boundaries.

Computational Linguistics

Data Intensive Linguistics: using large datasets to answer questions about language How do children learn language? How do technical terms get their meanings? Why do people have so little difficulty

understanding what each other are saying? How are words stored in the brain?


Machine understanding: building machines that read, write, converse using natural language.

Several well-known subtasks Tokenization: Parsing: building syntax trees Building meaning representations (MR) Generating language from MR


NLP: building systems that do useful or interesting things with language Summarization Machine Translation Question Answering Document Understanding

Relation to CSE

Challenging problems in working with large datasets. Document classification is large along three

dimensions Large number of available predictive features (104

different words in typical collections) Many instances (1000s or millions of sentences) Many possible outputs (e.g. classify against the 100s of

labels in the DMOZ hierarchy)

Relation to CSE

Consumer of CS tools Tokenization, Parsing

Could use lex and yacc (javacc/antlr), but beware ambiguity

Many special purpose parsers, taggers, chunkers that use machine learning to achieve robustness

Machine understanding AI-complete Prolog and other PL innovations caused by NL research

Why the world cares

1700 biology papers per day. Nobody can keep upUNDERSTAND/SUMMARIZE

Ad placement in search engines. Perhaps you can spot a search for flights to Paris, place a successful sidebar ad for expensive and elegant evening wear. INTENT

Automated essay grading CLASSIFICATION Too many emails to monitor. Spooks can’t keep up. Especially in Arabic

There is demand…

Develop language-independent algorithms, techniques, and methodologies to support rapid development of the basic resources … for any arbitrary language with a written form. Corpus-based unsupervised and lightly-supervised methods are acceptable, as are lightweight elicitation methodologies from untrained native speakers or other generally available (in the US) informants. Research on English and Foreign Language EXploitation (REFLEX)Broad Agency Announcement (BAA)BAA 04-01-FH15 March 2004

Current work

NSF Career project Key idea: dimensionality reduction for linguistic

data. Hypothesis: neighborhood structure is more

important and cognitively salient than (for example) preserving detail of long-distance relationships

Compare: min-cut, LLE, SNE, LSI

Paul Davis

Statistical Machine Translation Is there a simple and flexible architecture for

Statistical MT? Why: current systems are all built on an IBM design.

they all mess up they all mess up in much the same way Alternatives are needed.

Graduated 2002:now at Motorola Research

Martin Jansche

Learning String-to-String Transductions (mostly for text-to-speech)

Bucks -> /b u k z/ Why: People were doing lots of this, but the

theory, the evaluation criteria and the quality of the resulting systems left much to be desired.

Graduated 2003: now at Columbia Center for Machine Learning as research faculty

Nathan Vaillette

Formally verified string-to-string transductions. Rule: aa -> b Input aaacaa. What is the output? bbcb ? bacb ? abcb ? Why: rules like these are used a lot, but no

convincing account of exactly what they mean.

… Nathan Vaillette

Used technology from hardware verification (!) to build and implement formal model of string rewriting process.

First ever implementation of this widely used component for which the specification is clear and the correspondence between specification and implementation provably correct.

Graduated 2003 Now teaching AI at Hampshire College

Sabine Schulte im Walde

Inducing German Verb Classes from Corpus Data.

Why: build better dictionaries automatically Why: difficult large dataset Technology: k-means, spectral clustering Graduated:2003 from University of Stuttgart

Language Technology Manager with Duden dictionaries, then research staff University of Saarbrücken

Kyuchul Yoon

Grapheme to Phoneme conversion for Korean

Why: words of foreign origin need special treatment, existing machine learning approaches are too knowledge-free

Graduated 2005 Now at Pusan University

Anna Feldman

Using Czech language resources to bootstrap resources for Russian Why: Czech and Russian are supposed to be

related, but can we use this fact technologically? Yes. Works, but not perfectly.

Same thing, for Spanish and Portugese

Anton Rytting

Computational and experimental studies of spoken language, emphasis on word segmentation strategies that might be useful to infants

Why: infants should be able to learn any language.

Medical Informatics (very new) Collaboration with John Pestian, Cincinnati

Hospital Children's Medical Center Why: doctors provide discharge summaries

(i.e. text), we want information (mundanely: ICD-9 terms as billing codes)

How: neural networks, careful encoding of domain knowledge. Tuning of ICD-9 to include/exclude terms that do/don't occur in radiology summaries

What I’d like to do more of

Very large scale work Unsupervised and lightly supervised learning Cute applications of machine learning Distributed and parallel NLP

What I am looking for?

People who can take an idea about learning from data and turn it into a Master’s thesis. Especially people who have side expertise in an application area, such as medicine, biology, business, lion-taming.

Might have funding for the right person, though Linguistics Ph.D students take precedence.

What I am looking for?

People who can take an idea about learning from data and turn it into a Master’s thesis. Especially people who have side expertise in an application area, such as medicine, biology, business, lion-taming.

People with very good communication and programming skills who could collaborate with a Linguistics student to make something better than either could alone. Cognitive Science summer fellowships.

Interesting new problems that can be learned from data.

Computational Linguistics at OSU Chris Brew Linguistics, Cognitive Science and CSE The Ohio State...

Documents

Transcript of Computational Linguistics at OSU Chris Brew Linguistics, Cognitive Science and CSE The Ohio State...