Computational Linguistics at OSU Chris Brew Linguistics, Cognitive Science and CSE The Ohio State...
-
Upload
damon-anthony -
Category
Documents
-
view
218 -
download
1
Transcript of Computational Linguistics at OSU Chris Brew Linguistics, Cognitive Science and CSE The Ohio State...
Computational Linguistics at OSU
Chris BrewLinguistics, Cognitive Science and CSEThe Ohio State University
Who am I?
Chris Brew, Associate Professor Full-time in NLP since about 1984.
B.Sc Chemistry (Bristol) Masters and Ph.D (Sussex)
NLP done in a Psychology department! Research positions at Sussex, Edinburgh
and in industry (Sharp) Faculty in Linguistics at Ohio State since 2000 Joint appointment in CSE
What I’ve done
Parsing and Dialogue Machine Translation (teaching class now) XML and corpus annotation Learning word meanings from large datasets Sound/Meaning relations Other stuff…
Linguistics
Linguistics is the scientific study of language and communication.
Linguists run experiments, do surveys, build simulations, do proofs.
Linguistics at OSU is: In the top 10 nationally Diverse and open-minded
Strengths of Linguistics at OSU Syntax, Semantics, Pragmatics Phonetics: the study of how people make and
perceive the sounds of language Psycholinguistics: the study of how people
process sounds, words, sentences, intonation Sociolinguistics: the study of how society and
social situations change the way we speak. Computational Linguistics and NLP
Computational Linguistics at OSU 3 faculty members and 20 students based in
Linguistics (Oxley Hall) Detmar Meurers (Parsing, Corpus Annotation, Computer-
aided Language Learning) Chris Brew (Statistical NLP) Michael White (Natural Language Generation)
Close ties with Drs. Byron and Fosler-Lussier in CSE.
We are willing and able to advise or co-advise on research, and have projects that cross the departmental boundaries.
Computational Linguistics
Data Intensive Linguistics: using large datasets to answer questions about language How do children learn language? How do technical terms get their meanings? Why do people have so little difficulty
understanding what each other are saying? How are words stored in the brain?
Computational Linguistics
Machine understanding: building machines that read, write, converse using natural language.
Several well-known subtasks Tokenization: Parsing: building syntax trees Building meaning representations (MR) Generating language from MR
Computational Linguistics
NLP: building systems that do useful or interesting things with language Summarization Machine Translation Question Answering Document Understanding
Relation to CSE
Challenging problems in working with large datasets. Document classification is large along three
dimensions Large number of available predictive features (104
different words in typical collections) Many instances (1000s or millions of sentences) Many possible outputs (e.g. classify against the 100s of
labels in the DMOZ hierarchy)
Relation to CSE
Consumer of CS tools Tokenization, Parsing
Could use lex and yacc (javacc/antlr), but beware ambiguity
Many special purpose parsers, taggers, chunkers that use machine learning to achieve robustness
Machine understanding AI-complete Prolog and other PL innovations caused by NL research
Why the world cares
1700 biology papers per day. Nobody can keep upUNDERSTAND/SUMMARIZE
Ad placement in search engines. Perhaps you can spot a search for flights to Paris, place a successful sidebar ad for expensive and elegant evening wear. INTENT
Automated essay grading CLASSIFICATION Too many emails to monitor. Spooks can’t keep up. Especially in Arabic
There is demand…
Develop language-independent algorithms, techniques, and methodologies to support rapid development of the basic resources … for any arbitrary language with a written form. Corpus-based unsupervised and lightly-supervised methods are acceptable, as are lightweight elicitation methodologies from untrained native speakers or other generally available (in the US) informants. Research on English and Foreign Language EXploitation (REFLEX)Broad Agency Announcement (BAA)BAA 04-01-FH15 March 2004
Current work
NSF Career project Key idea: dimensionality reduction for linguistic
data. Hypothesis: neighborhood structure is more
important and cognitively salient than (for example) preserving detail of long-distance relationships
Compare: min-cut, LLE, SNE, LSI
Paul Davis
Statistical Machine Translation Is there a simple and flexible architecture for
Statistical MT? Why: current systems are all built on an IBM design.
they all mess up they all mess up in much the same way Alternatives are needed.
Graduated 2002:now at Motorola Research
Martin Jansche
Learning String-to-String Transductions (mostly for text-to-speech)
Bucks -> /b u k z/ Why: People were doing lots of this, but the
theory, the evaluation criteria and the quality of the resulting systems left much to be desired.
Graduated 2003: now at Columbia Center for Machine Learning as research faculty
Nathan Vaillette
Formally verified string-to-string transductions. Rule: aa -> b Input aaacaa. What is the output? bbcb ? bacb ? abcb ? Why: rules like these are used a lot, but no
convincing account of exactly what they mean.
… Nathan Vaillette
Used technology from hardware verification (!) to build and implement formal model of string rewriting process.
First ever implementation of this widely used component for which the specification is clear and the correspondence between specification and implementation provably correct.
Graduated 2003 Now teaching AI at Hampshire College
Sabine Schulte im Walde
Inducing German Verb Classes from Corpus Data.
Why: build better dictionaries automatically Why: difficult large dataset Technology: k-means, spectral clustering Graduated:2003 from University of Stuttgart
Language Technology Manager with Duden dictionaries, then research staff University of Saarbrücken
Kyuchul Yoon
Grapheme to Phoneme conversion for Korean
Why: words of foreign origin need special treatment, existing machine learning approaches are too knowledge-free
Graduated 2005 Now at Pusan University
Anna Feldman
Using Czech language resources to bootstrap resources for Russian Why: Czech and Russian are supposed to be
related, but can we use this fact technologically? Yes. Works, but not perfectly.
Same thing, for Spanish and Portugese
Anton Rytting
Computational and experimental studies of spoken language, emphasis on word segmentation strategies that might be useful to infants
Why: infants should be able to learn any language.
Medical Informatics (very new) Collaboration with John Pestian, Cincinnati
Hospital Children's Medical Center Why: doctors provide discharge summaries
(i.e. text), we want information (mundanely: ICD-9 terms as billing codes)
How: neural networks, careful encoding of domain knowledge. Tuning of ICD-9 to include/exclude terms that do/don't occur in radiology summaries
What I’d like to do more of
Very large scale work Unsupervised and lightly supervised learning Cute applications of machine learning Distributed and parallel NLP
What I am looking for?
People who can take an idea about learning from data and turn it into a Master’s thesis. Especially people who have side expertise in an application area, such as medicine, biology, business, lion-taming.
Might have funding for the right person, though Linguistics Ph.D students take precedence.
What I am looking for?
People who can take an idea about learning from data and turn it into a Master’s thesis. Especially people who have side expertise in an application area, such as medicine, biology, business, lion-taming.
People with very good communication and programming skills who could collaborate with a Linguistics student to make something better than either could alone. Cognitive Science summer fellowships.
Interesting new problems that can be learned from data.