Всеволод Демкин "Natural language processing на практике"

Post on 17-May-2015

9.445 views 4 download

Tags:

description

Конференция "AI&BigData Lab", 12 апреля 2014

Transcript of Всеволод Демкин "Natural language processing на практике"

Natural Language Processingin practice

Topics

* Overview of NLP* Getting Data* Models & Algorithms* Building an NLP system* A practical example

A bit about me* Lisp programmer* Architect and research lead at Grammarly (3+ years of NLP work)* Teacher at KPI: Operating Systems

* Links:http://lisp-univ-etc.blogspot.comhttp://github.com/vselovedhttp://twitter.com/vseloved

A bit about Grammarly

(c) xkcd

The best English language writing enhancement app:Spellcheck - Grammar check - Style improvement - Synonyms and word choice - Plagiarism check

What is NLP?Transforming free-form text into structured data and back

Intersection of Comp Sci & Linguistics & Software Eng

Based on Algorithms, Machine Learning, and Statistics

Popular NLP problems* Spam Filtering* Spelling Correction* Sentiment Analysis* Question Answering* Machine Translation* Text Summarization* Search (also IR)

http://www.paulgraham.com/spam.htmlhttp://norvig.com/spell-correct.html

(c) gettyimages

Levels of NLP* data & tools* models* production-ready systems

Role of Linguistics

NLP Datastructured semi-structured – unstructured–

“Data is ten times more

powerful than algorithms.”

-- Peter NorvigThe UnreasonableEffectiveness of Data.http://youtu.be/yvDCzhbjYWs

Kinds of data* Dictionaries* Corpora* User Data

Where to get data?* Linguistic Data Consortium http://www.ldc.upenn.edu/ * Google ngrams, book ngrams, syntactic ngrams* Wikimedia* Wordnet* APIs: Twitter, Wordnik, ...* University sites: Stanford, Oxford, CMU, ...

Create your own!* Linguists* Crowdsourcing* By-product

-- Johnatahn Zittrain http://goo.gl/hs4qB

Tools* analysis tools* processing tools

* Unix command line* XML processing* Map-reduce systems* R, Python, Lisp

(c) O'Reilly Media

Algorithms

* Dynamic Programming* Search Algorithms* Tree Algorithms

Beyond Algorithms

* CKY constituency parsing* Noisy channel spelling correction* TF-IDF document classification* Bayesian filtering

Models

* generative vs discriminative* statistical vs rule-based

Language ModelsNgrams

Generative ML models:* Bayesian inference (bag-of-words model)* Hidden Markov model (sequence model)* Neural networks (holistic model)

LM + Domain Model

Discriminative Models

* Heuristic* Maximum Entropy* “Advanced” LM Models

Going Into Prod

* Translate real-world requirements into a measurable goal * Pre- and post- processing * Don't trust research results * Gather user feedback

Practical Example:Language Detection

IdeaStandard approach:character LM

Let's try an alternative:word LM

Data – from WiktionaryTest data from Wikipedia–

Practical ML System

* Training

ML System

* Training* Evaluation

ML System

* Training* Evaluation* Production

Thanks!

Questions?

Vsevolod Dyomkin@vseloved