Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09...

25
NLP “Crash Course” Charlie Greenbacker Berico Technologies

Transcript of Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09...

Page 1: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

NLP “Crash Course”Charlie Greenbacker

Berico Technologies

Page 2: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Agenda

• Introduction & Motivation

• Famous Examples

• Basics

• Major Task Areas

• Protips

• Resources

Page 3: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Introduction& Motivation

About Me:

NLP since 2003? 2005? 2007?

PhD candidate (ABD) in NLP“Leave of Absence” from UDel

Principal Data ScientistBerico Technologies

Page 4: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Introduction& Motivation

By “NLP” we mean...

Natural Language Processing(#NLProc)

aka Computational Linguistics, Text Analytics, etc.

not Neuro-linguistic Programming! (#NLP)

Page 5: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Introduction& Motivation

Natural Language Processing is...

Using computers to process (i.e., analyze, understand, generate, etc.) natural human languages (e.g., English, Chinese, Klingon).

Hello, world! 你好,世界!

Page 6: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

That sounds hard... why should I care?

• Most of the knowledge created by humans is unstructured text (information overload)

• Need some way to make sense of it all

• Enable quantitative analysis of text data

Introduction& Motivation

Page 7: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Famous Examples

Siri (Apple, SRI, Nuance)Speech Recognition/Generation

IBM WatsonQuestion Answering

Google TranslateMachine Translation

Page 8: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Basics

• Segmentation

• Part-of-speech tagging

• Noun phrase (NP) chunking

• Parsing

• Word sense disambiguation

Page 9: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Basics

• Stop words, stemming/lemmatization

• Frequency analysis(terms, ngrams, TF-IDF)

• Machine learning (classification, clustering, recommendation)

Page 10: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Major Task Areas

Question Answering

• Match query with knowledge base

• Closed domain vs open domain

• Reasoning about intent of question

Page 11: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Major Task Areas

Speech Recognition

• Speech to text

• Trained/untrained user models

• Voice-based interfaces

Page 12: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Major Task Areas

Named Entity Recognition

• Entity extraction

• Persons, organizations, location

• Grammar, syntax, phrasing

Page 13: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Major Task Areas

Entity Resolution

• Linking names to ground truth

• Disambiguating similar names

Page 14: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Major Task Areas

Co-reference Resolution

• Finding antecedents for pronouns

• Name resolution

Page 15: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Major Task Areas

Relationship Extraction

• Attribute values

• SVO triples

• Populating ontologies

Page 16: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Major Task Areas

Information Retrieval

• Query expansion

• Relevancy of results

• “More like this”

Page 17: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Major Task Areas

Assistive Technologies

• Text simplification

• Predictive text input

• Alternative interfaces

Page 18: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Major Task Areas

NLG + Automatic Summarization

• Generating text from data

• Extractive summarization

• Abstractive summarization

Page 19: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Major Task Areas

Machine Translation

• From source to target, and back!

• Single terms work... sometimes

• Idioms, metaphors, cultural references

Page 20: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Major Task Areas

Sentiment Analysis

• Polarity, intensity, direction

• "Easy" for movie/product reviews

• "Impossible" for nearly anything else

Page 21: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Protips

• Domain adaptation(retrain your models, social media != news)

• Assume everything is in beta(error rates compound, translate last, consult the research literature)

• Evaluation is essential(human judges, “gold standard” data,cross-validation, appropriate metrics)

Page 22: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Resources(toolkits)

Stanford CoreNLPJava, GPL

Apache OpenNLPJava, Apache License

NLTKPython, Apache License

Page 23: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Resources(books)

Natural LanguageProcessing with PythonBird, Klein, and Loper

Speech and Language______________Processing______________

Jurafsky and Martin______________

Foundations of StatisticalNatural Language ProcessingManning and Schütze

Page 24: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Resources(groups)

ACL (Association for Computational Linguistics)Conferences, Workshops, Journals, SIGs

DC NLPNLP Meetups

Data Community DCNLP Workshops

Page 25: Charlie Greenbacker Berico Technologies - Meetupfiles.meetup.com/7616132/DC-NLP-2013-09 Charlie... · 2013-09-27 · NLP “Crash Course ... Text Analytics, etc. not Neuro-linguistic

Questions?

Charlie GreenbackerPrincipal Data ScientistBerico Technologies

@greenbacker